Supervised learning

Definition

The term supervised learning is used to describe a subclass of machine learning problems where we are provided with a set of labeled examples for training and can use that data to determine a function that would take any new example and predict the label for that. There are two types of supervised learning techiques, classification and regression.

Here, the term "example" refers to the input part of the function (that can be used to make the prediction) and the term "label" refers to the output of the function (that needs to be predicted).

Steps of supervised learning

Supervised learning is a process that goes through several steps, which are presented here in a table.

Aspect	What gets chosen here	Description
Feature selection	The set of features that the model depends on.	Based on the problem domain, we come up with a list of relevant features that affect the output function. If we choose too few features, then the task might be theoretically impossible. For instance, if the only feature we have for a house is its area, and we need to predict the price, we cannot do the prediction too well. The more the features, the better our ability to predict in principle. However, too many features means more effort spent collecting their values, and there are also dangers of overfitting.
Model class selection (not to be confused with hyperparameter optimization)	The functional form (with parameters) describing how the output depends on the features.	This again depends on theoretical knowledge based on the problem domain, as well as empirical exploration of the data gathered.
Cost function selection	The cost function (or error function) used to measure error on new data.	This again depends on theoretical knowledge based on the problem domain, and also on the choice of model type. Often, the error function selection is bundled with the model class selection, because part of the model class selection process also includes identifying the nature of distribution of errors or anomalies. This has the property that if we choose parameters so that our predicted function matches the actual function precisely, the error is zero. In principle, however, the error function is independent of the model class, so we can essentially combine any permissible model class with any permissible error function.
Regularization-type choices	The choice of regularization function to add to the cost function when using on the training data. Requires choosing regularization hyperparameter(s).
Learning algorithm applied on the training data	The values of the parameters (not uniquely determined, we might get a portfolio of choices for different hyperparameter choices)	This is the algorithm that tries to solve the optimization function of choosing values of the parameters (for our chosen model) so that the error function is minimized (or close to minimized). Note that we are trying to minimize the error function for unknown inputs, but our algorithm is being trained on known inputs. Thus, there are issues of overfitting. This problem is addressed through a number of techniques, including regularization and early stopping.
Cross-validation (relates to hyperparameter optimization)	The actual values of the hyperparameters and parameters	This includes techniques that tweak the hyperparameters that control the performance of the learning algorithm. These could include learning rate parameters for gradient descent, or regularization parameters introduced to avoid overfitting.