Learning curve for number of iterations

Definition

The learning curve for number of iterations is a particular type of learning curve used to gauge the performance of iterative learning algorithms in machine learning problems. The picture typically includes two or three curves in the same graph as follows:

The horizontal axis plots the number of iterations. This does not vary continuosly, but rather varies discretely (it can take nonnegative integer values).
The vertical axis plots the cost function values. Typically, two curves are drawn: the curve for the cost function on the training set, and the curve for the cost function on the cross-validation set. Note that because the number of iterations is only allowed to be a nonnegative integer, we don't actually get a curve. Rather, we get a discrete set of points, but we construct a curve by joining the points for visual ease.

For the cost function on the training set, we may plot either the cost function including regularization, or the cost function excluding regularization. For the cross-validation set, we generally plot the cost function excluding regularization. That's because the purpose of regularization is to improve generalizability, and ultimately the error is measured using the unregularized function.

Learning curve on the training set

Monotonicity

The learning curves should generally be downward-sloping: the cost function should go down with more iterations. However, whether it is monotonic or not depends on the choice of learning algorithm. There are two complications:

Some learning algorithms are not monotonic: there are occasional increases in the cost function even though the long-run trend is downward.
Even if the learning algorithm is monotonic, the unregularized cost function need not go down monotonically.

Asymptotic value

The following constraints are imposed:

There is a theoretical minimum value that the cost function can take. Regardless of the choice of learning algorithm or the number of iterations, this provides a lower bound on the cost function we can obtain by applying the learning algorithm.
If the cost function is non-convex, then there may be local minima distinct from the absolute minimum. Learning algorithms that use local methods would get stuck at such local minima, and depending on the initial vector, they could converge to a local minimum distinct from the global minimum. Whatever local minimum an algorithm converges to based on the initial parameter vector choice puts a lower bound on the cost function regardless of the number of iterations.
Some learning algorithms don't converge to a local minimum at all -- they reduce the cost function somewhat, but then the cost function stabilizes at or approaches a value other than the local minimum. This is an undesirable attribute of learning algorithms, but the learning algorithm may have other desirable attributes, such as a fast initial drop in the cost function, or low memory requirements or high parallelizability, that make them desirable.

Convergence rate

Fill this in later