Machine learning: Difference between revisions

Revision as of 16:21, 7 June 2014

Definition

Machine learning can be defined as the problem of using partial data about a function or relation to make predictions about how the function or relation would perform on new data.

The case of learning a function

Although this particular formulation is not the most natural formulation of all machine learning problems, any machine learning problem can be converted to this formulation.

An output function depends on some inputs (called features) (and possibly a random noise component). We have access to some training data where we have the value of the output function, and we either already have or can get the values of the features. The machine learning problem asks for an explicit description of a function that would perform well on new inputs that we provide to it.

Aspects of machine learning

Aspect	What gets chosen here	Description
Feature selection	The set of features that the model depends on.	Based on the problem domain, we come up with a list of relevant features that affect the output function. If we choose too few features, then the task might be theoretically impossible. For instance, if the only feature we have for a house is its area, and we need to predict the price, we cannot do the prediction too well. The more the features, the better our ability to predict in principle. However, too many features means more effort spent collecting their values, and there are also dangers of overfitting.
Model selection (not to be confused with hyperparameter optimization)	The functional form (with parameters) describing how the output depends on the features.	This again depends on theoretical knowledge based on the problem domain, as well as empirical exploration of the data gathered.
Cost function selection	The cost function (or error function) used to measure error on new data.	This again depends on theoretical knowledge based on the problem domain, and also on the choice of model. Often, the error function selection is bundled with the model selection, because part of the model selection process also includes identifying the nature of distribution of errors or anomalies. This has the properly that if we choose parameters so that our predicted function matches the actual function precisely, the error is zero. In principle, however, the error function is independent of the model, so we can essentially combine any permissible model with any permissible error function.
Regularization-type choices	The choice of regularization function to add to the cost function when using on the training data. Requires choosing regularization hyperparameter(s).
Learning algorithm	This is the algorithm that tries to solve the optimization function of choosing values of the parameters (for our chosen model) so that the error function is minimized (or close to minimized). Note that we are trying to minimize the error function for unknown inputs, but our algorithm is being trained on known inputs. Thus, there are issues of overfitting. This problem is addressed through a number of techniques, including regularization and early stopping.
Hyperparameter optimization for learning algorithm	This includes techniques that tweak the hyperparameters that control the performance of the learning algorithm. These could include learning rate parameters for gradient descent, or regularization parameters introduced to avoid overfitting.

@@ Line 12: / Line 12: @@
 {| class="sortable" border="1"
-! Aspect !! Description
+! Aspect !! What gets chosen here !! Description
 |-
-| [[Feature selection]]: Selection of the set of features that the output depends on. || Based on the problem domain, we come up with a list of relevant features that affect the output function. If we choose too few features, then the task might be theoretically impossible. For instance, if the only feature we have for a house is its area, and we need to predict the price, we cannot do the prediction too well. The more the features, the better our ability to predict in principle. However, too many features means more effort spent collecting their values, and there are also dangers of [[overfitting]].
+| [[Feature selection]] || The set of features that the model depends on. || Based on the problem domain, we come up with a list of relevant features that affect the output function. If we choose too few features, then the task might be theoretically impossible. For instance, if the only feature we have for a house is its area, and we need to predict the price, we cannot do the prediction too well. The more the features, the better our ability to predict in principle. However, too many features means more effort spent collecting their values, and there are also dangers of [[overfitting]].
 |-
-| [[Model selection]] (not to be confused with [[hyperparameter optimization]]): Selection of the functional form (with [[parameter]]s) describing how the output depends on the features || This again depends on theoretical knowledge based on the problem domain, as well as empirical exploration of the data gathered.
+| [[Model selection]] (not to be confused with [[hyperparameter optimization]]) || The functional form (with [[parameter]]s) describing how the output depends on the features. || This again depends on theoretical knowledge based on the problem domain, as well as empirical exploration of the data gathered.
 |-
-| [[Error function selection]]: Selection of the error function used to measure error on new data (not used for training) || This again depends on theoretical knowledge based on the problem domain, and also on the choice of model. Often, the error function selection is bundled with the model selection, because part of the model selection process also includes identifying the nature of distribution of errors or anomalies. This has the properly that if we choose parameters so that our predicted function matches the actual function precisely, the error is zero. In principle, however, the error function is independent of the model, so we can essentially combine any permissible model with any permissible error function.
+| [[Cost function selection]] || The cost function (or error function) used to measure error on new data. || This again depends on theoretical knowledge based on the problem domain, and also on the choice of model. Often, the error function selection is bundled with the model selection, because part of the model selection process also includes identifying the nature of distribution of errors or anomalies. This has the properly that if we choose parameters so that our predicted function matches the actual function precisely, the error is zero. In principle, however, the error function is independent of the model, so we can essentially combine any permissible model with any permissible error function.
+|-
+| Regularization-type choices || The choice of regularization function to add to the cost function when using on the training data. Requires choosing [[regularization hyperparameter]](s). ||
 |-
 | [[Learning algorithm]] || This is the algorithm that tries to solve the optimization function of choosing values of the parameters (for our chosen model) so that the error function is minimized (or close to minimized). Note that we are trying to minimize the error function for unknown inputs, but our algorithm is being trained on known inputs. Thus, there are issues of [[overfitting]]. This problem is addressed through a number of techniques, including [[regularization]] and [[early stopping]].