Learning algorithm hyperparameter
A learning algorithm hyperparameter is a hyperparameter that controls the execution of the learning algorithm, i.e., different choices of the hyperparameter could affect the time taken, resources taken, and nature of final parameter vector output by the learning algorithm.
We could take a narrow view or a broad view of the scope of the learning algorithm. Under the narrow view, the goal of the learning algorithm is simply to solve a calculus optimization problem: minimize the (possibly regularized) cost function value on the training set, and do not worry about the generalization error, because that's something to be decided at a different level (namely, the selection of the regularization hyperparameter). Under the broader view, the learning algorithm is directly responsible for trying to get low generalization error, as measured by error on a (withheld) cross-validation set or test set.
Note that even with the narrow view, it may still be beneficial to stop well short of convergence, due to time and resource constraints. With the broad view, another reason for stopping short of convergence is that beyond a certain point, improving performance (i.e., reducing error) on the training set does not reduce generalization error much, and might even increase it.
Learning rate hyperparameters
A learning rate hyperparameter controls how aggressively the algorithm updates the parameter vector at each iteration. For most iterative algorithms, the following are true:
- Learning rates that are too high will lead to divergence, and therefore, the algorithm will not converge to the optimum.
- Learning rates that are at or slightly below the optimal learning rate will lead to substantially faster convergence. In fact, the exactly correct choice of learning rate can lead to a higher order of convergence than other learning rates, or even lead to complete convergence in finitely many steps.
- Learning rates that are substantially lower than the optimal learning rate will result in very slow convergence, but the algorithm will still converge.
Stopping rule hyperparameters
A stopping rule or termination rule hyperparameter is a hyperparameter that controls when the learning algorithm stops. Stopping rules could be of many kinds:
Number of iterations
The number of iterations for the iterative algorithm: This number could itself be a hyperparameter, or we could have a formula to calculate the number of iterations based on aspects of the problem structure, with some hyperparameters within the formulas.
For an algorithm that is generally guaranteed to converge, it is true at least in expectation that the larger the number of iterations, the lower the error on the training set. In fact, for most learning algorithms, the error function is monotonically decreasing with the number of iterations for any fixed choice of initial parameter vector. However, the relation between number of iterations and generalization error is more theoretically ambiguous. There are cases where increasing the number of iterations increases the overfitting problem sufficiently severely that generalization error starts increasing if we increase the number of iterations too much.
Nature of cost function as a function of the number of iterations
Another way of deciding whether to stop is by examining how the cost function value changes with the number of iterations, and setting a rule where, if recent declines have been small compared with earlier declines, we stop. Based additionally on the theory of how the algorithm converges, we can use the cost function values so far to predict (approximately) the optimal value, and therefore, to know how far off from the optimal value we currently are.
Nature of changes to parameter vector in the domain
While the above focuses on how the cost function is changing, we could also consider the extent to which the parameter vector for the model is changing, and wait till changes in that are slow enough.
Subsampling hyperparameters
In some cases, we want to execute part or all of the learning algorithm using a small subsample of the examples (row subsampling) or a small subsample of the features (column subsampling) or both. Row subsampling is generally done in cases where the examples are sufficiently similar that a small enough sample conveys enough information about all examples. Column subsampling is generally done in cases where some of the columns (features) are much more informative than others, so subsampling to those still gives us a pretty good model.
Subsampling may be done for part of the algorithm, after which we switch to using the full sample.
Hyperparameters associated with subsampling include:
- Choice of the fraction of examples or features that are to be subsampled
- Choice of the criteria by which examples and features are selected for subsampling
- Choice of the stage at which to switch from subsampling to using the full sample, or the stage at which to switch from subsampling fewer examples and features to subsampling more examples and features.
Approximation hyperparameters
In some cases, the computations can be made faster and more scalable through approximations. There could be hyperparameters associated with how precise we keep the approximations.