Learning curve for number of iterations

Definition

The learning curve for number of iterations is a particular type of learning curve used to gauge the performance of iterative learning algorithms in machine learning problems. The picture typically includes two or three curves in the same graph as follows:

The horizontal axis plots the number of iterations. This does not vary continuosly, but rather varies discretely (it can take nonnegative integer values).
The vertical axis plots the cost function values. Typically, two curves are drawn: the curve for the cost function on the training set, and the curve for the cost function on the cross-validation set. Note that because the number of iterations is only allowed to be a nonnegative integer, we don't actually get a curve. Rather, we get a discrete set of points, but we construct a curve by joining the points for visual ease.

For the cost function on the training set, we may plot either the cost function including regularization, or the cost function excluding regularization. For the cross-validation set, we generally plot the cost function excluding regularization. That's because the purpose of regularization is to improve generalizability, and ultimately the error is measured using the unregularized function.

Learning curve on the training set

Monotonicity

The learning curves should generally be downward-sloping: the cost function should go down with more iterations. However, whether it is monotonic or not depends on the choice of learning algorithm. There are two complications:

Some learning algorithms are not monotonic: there are occasional increases in the cost function even though the long-run trend is downward.
Even if the learning algorithm is monotonic, the unregularized cost function need not go down monotonically.

Asymptotic value

The following constraints are imposed:

There is a theoretical minimum value that the cost function can take. Regardless of the choice of learning algorithm or the number of iterations, this provides a lower bound on the cost function we can obtain by applying the learning algorithm.
If the cost function is non-convex, then there may be local minima distinct from the absolute minimum. Learning algorithms that use local methods would get stuck at such local minima, and depending on the initial vector, they could converge to a local minimum distinct from the global minimum. Whatever local minimum an algorithm converges to based on the initial parameter vector choice puts a lower bound on the cost function regardless of the number of iterations.
Some learning algorithms don't converge to a local minimum at all -- they reduce the cost function somewhat, but then the cost function stabilizes at or approaches a value other than the local minimum. This is an undesirable attribute of learning algorithms, but the learning algorithm may have other desirable attributes, such as a fast initial drop in the cost function, or low memory requirements or high parallelizability, that make them desirable.

Convergence rate

Fill this in later

Learning curve on the cross-validation set (or test set)

Altough the remarks here are framed in the language of an explicitly designated cross-validation set, they may be applied directly to the (holdout) test set.

Monotonicity and rebound

The behavior on the cross-validation set is heavily dependent on the attributes of the model and the choice of learning algorithm. Here are some typical possibilities:

The cost function is monotonically decreasing for the cross-validation set, and the asymptotic value for the cross-validation set coincides with that for the training set (although convergence for the cross-validation set may be somewhat slower). This is a situation of no overfitting).
The cost function monotonically decreases for the cross-validation set, but the asymptotic value on the cross-validation set is higher than on the training set. Note that this problem can be fixed somewhat using regularization.
The cost function decreases for the cross-validation set but then starts increasing again. This is an attribute of some learning algorithms where the bulk of the learning for initial iterations is concentrated in generalizable attributes, but these get exhausted quickly, and later iterations end up learning mostly non-generalizable attributes. A remedy for this is early stopping: we need to stop before the cross-validation error starts going up.