Machine learning: Difference between revisions

Revision as of 02:09, 26 July 2014

Definition

Machine learning can be defined as the problem of using partial data about a function or relation to make predictions about how the function or relation would perform on new data.

The case of learning a function

Although this particular formulation is not the most natural formulation of all machine learning problems, any machine learning problem can be converted to this formulation.

An output function depends on some inputs (called features) (and possibly a random noise component). We have access to some training data where we have the value of the output function, and we either already have or can get the values of the features. The machine learning problem asks for an explicit description of a function that would perform well on new inputs that we provide to it.

Types of machine learning

Classification based on nature of training data

Type of machine learning	Description
supervised learning	Explicit training data (input-output pairs) are provided. These can be used to learn the parameters for the functional form that can then be used to make predictions.
unsupervised learning	The training data as provided does not provide explicit outputs for explicit inputs. For instance, in the context of classification problems, an unsupervised learning problem would simply provide a lot of inputs without specifying the output values for those inputs. The job of the machine learning algorithm is to use the distribution of the inputs to figure out the output values.
semi-supervised learning	The training data is a mix of explicit input-output pairs and input-only data. Semi-supervised learning combines some of the aspects of supervised learning and unsupervised learning.

Classification based on nature of prediction

Type of prediction problem	Description
binary classification problem	Here, the output has only two permissible values. Therefore, the prediction problem can be viewed as a yes/no or a true/false prediction problem. Models for binary classification can be probabilistic (such as logistic regression or artificial neural networks) or yes/no (such as support vector machines). Note that probabilistic binary classification models structurally resemble regression models.
discrete variable prediction problem, or multi-class classification problem	Here, the output can take one of a finite number of values. We can also think of this as a classification problem where each input case needs to be sorted into one of finitely many classes.
regression problem, or continuous variable prediction problem	Here, the output can take a value over a continuous range of values.

Classification based on the stage of learning

Stage of learning	Description
eager learning	This is the more common form of learning. Here, the training data is pre-processed and an explicit and compact representation of the function is learned from it (usually, in the form of a paramater vector). The learning phase that uses the training data to learn the compact representation takes substantially greater time, but once this phase is done, the training data can be thrown away. The compact representation takes much less memory than the training data and can be used to quickly compute the output for any input.
lazy learning	With this form of learning, the training data is not completely pre-processed. Every time a new prediction needs to be made, the training data is used to make the prediction. Lazy learning algorithms are useful in cases where the prediction of function values is based on nearby values.
online learning	Here, the input data is streamed to the algorithm, one instance at a time. For the one-instance case (each data point) three steps occur: (a) The algorithm reads the data (the input only), (b) the algorithm makes a prediction of the output, (c) the algorithm learns the actual output and updates the parameters according to that.

@@ Line 9: / Line 9: @@
 An output function depends on some inputs (called [[feature]]s) (and possibly a random noise component). We have access to some training data where we have the value of the output function, and we either already have or can get the values of the features. The machine learning problem asks for an explicit description of a function that would perform well on new inputs that we provide to it.
-==Aspects of machine learning==
+==Types of machine learning==
+===Classification based on nature of training data===
+{| class="sortable" border="1"
+! Type of machine learning !! Description
+|-
+| [[supervised learning]] || Explicit training data (input-output pairs) are provided. These can be used to learn the parameters for the functional form that can then be used to make predictions.
+|-
+| [[unsupervised learning]] || The training data as provided does not provide explicit outputs for explicit inputs. For instance, in the context of classification problems, an unsupervised learning problem would simply provide a lot of inputs without specifying the output values for those inputs. The job of the machine learning algorithm is to use the distribution of the inputs to figure out the output values.
+|-
+| [[semi-supervised learning]] || The training data is a mix of explicit input-output pairs and input-only data. Semi-supervised learning combines some of the aspects of supervised learning and unsupervised learning.
+|}
+===Classification based on nature of prediction===
 {| class="sortable" border="1"
-! Aspect !! What gets chosen here !! Description
+! Type of prediction problem !! Description
 |-
-| [[Feature selection]] || The set of features that the model depends on. || Based on the problem domain, we come up with a list of relevant features that affect the output function. If we choose too few features, then the task might be theoretically impossible. For instance, if the only feature we have for a house is its area, and we need to predict the price, we cannot do the prediction too well. The more the features, the better our ability to predict in principle. However, too many features means more effort spent collecting their values, and there are also dangers of [[overfitting]].
+| binary classification problem || Here, the output has only two permissible values. Therefore, the prediction problem can be viewed as a yes/no or a true/false prediction problem. Models for binary classification can be probabilistic (such as [[logistic regression]] or [[artificial neural network]]s) or yes/no (such as [[support vector machine]]s). Note that probabilistic binary classification models structurally resemble regression models.
 |-
-| [[Model selection]] (not to be confused with [[hyperparameter optimization]]) || The functional form (with [[parameter]]s) describing how the output depends on the features. || This again depends on theoretical knowledge based on the problem domain, as well as empirical exploration of the data gathered.
+| discrete variable prediction problem, or multi-class classification problem || Here, the output can take one of a finite number of values. We can also think of this as a classification problem where each input case needs to be sorted into one of finitely many classes.
 |-
-| [[Cost function selection]] || The cost function (or error function) used to measure error on new data. || This again depends on theoretical knowledge based on the problem domain, and also on the choice of model. Often, the error function selection is bundled with the model selection, because part of the model selection process also includes identifying the nature of distribution of errors or anomalies. This has the properly that if we choose parameters so that our predicted function matches the actual function precisely, the error is zero. In principle, however, the error function is independent of the model, so we can essentially combine any permissible model with any permissible error function.
+| regression problem, or continuous variable prediction problem || Here, the output can take a value over a continuous range of values.
+|}
+===Classification based on the stage of learning===
+{| class="sortable" border="1"
+! Stage of learning !! Description
 |-
-| Regularization-type choices || The choice of regularization function to add to the cost function when using on the training data. Requires choosing [[regularization hyperparameter]](s). ||
+| [[eager learning]] || This is the more common form of learning. Here, the training data is pre-processed and an explicit and compact representation of the function is learned from it (usually, in the form of a paramater vector). The learning phase that uses the training data to learn the compact representation takes substantially greater time, but once this phase is done, the training data can be thrown away. The compact representation takes much less memory than the training data and can be used to quickly compute the output for any input.
 |-
-| [[Learning algorithm]] applied on the training data || The values of the parameters (not uniquely determined, we might get a portfolio of choices for different hyperparameter choices)|| This is the algorithm that tries to solve the optimization function of choosing values of the parameters (for our chosen model) so that the error function is minimized (or close to minimized). Note that we are trying to minimize the error function for unknown inputs, but our algorithm is being trained on known inputs. Thus, there are issues of [[overfitting]]. This problem is addressed through a number of techniques, including [[regularization]] and [[early stopping]].
+| [[lazy learning]] || With this form of learning, the training data is not completely pre-processed. Every time a new prediction needs to be made, the training data is used to make the prediction. Lazy learning algorithms are useful in cases where the prediction of function values is based on nearby values.
 |-
-| [[Cross-validation]] (relates to hyperparameter optimization) || The actual values of the hyperparameters and parameters || This includes techniques that tweak the [[hyperparameter]]s that control the performance of the learning algorithm. These could include learning rate parameters for gradient descent, or regularization parameters introduced to avoid overfitting.
+| [[online learning]] || Here, the input data is streamed to the algorithm, one instance at a time. For the one-instance case (each data point) three steps occur: (a) The algorithm reads the data (the input only), (b) the algorithm makes a prediction of the output, (c) the algorithm learns the actual output and updates the parameters according to that.
 |}