This article focuses on gradient descent as it is used in machine learning. For more on gradient descent from a purely mathematical perspective, see gradient descent on the calculus wiki

## Definition

Gradient descent is a general-purpose first-order iterative algorithm used to find minima of functions (generally, convex functions). It is used in the machine learning context to solve the problem of finding the choice of parameter vector that minimizes a (possibly regularized) cost function for a training set. Gradient descent is iterative and therefore an anytime algorithm: it can be stopped at any stage and we would have a guess for the parameter vector, with the guess getting better the more iterations we allow.

As a general rule, gradient descent has linear convergence, like most first-order methods. Accelerated gradient methods and second-order iterative methods can converge substantially faster.

## Challenges to designing a gradient descent algorithm

### Determining the learning rate

Gradient descent could use a constant learning rate, an exponentially decaying learning rate, or not have a predetermined learning rate at all but rather use one-dimensional optimization along the gradient direction to determine the optimal step size at all. Line search-based methods cost more per iteration but also usually give greater improvements in the cost function per iteration. How the added cost per iteration trades off against the greater improvement per iteration is an empirical question.