Feature selection: Difference between revisions

From Machinelearning
No edit summary
No edit summary
Line 1: Line 1:
==Definition==
==Definition==


'''Feature selection''' is the process of selecting the set of features of a given input datum that we will use to predict the corresponding output value. For instance, a set of features that we may use to predict the price of a house may be the number of floors, floor area, zip code, size of forth porch, and number of windows.
'''Feature selection''' is the process of selecting the set of features of a given input datum that we will use to predict the corresponding output value. For instance, a set of features that we may use to predict the price of a house may be the number of floors, floor area, zip code, size of front porch, and number of windows.


==Distinction between elementary feature selection and derived feature selection==
==Distinction between elementary feature selection and derived feature selection==

Revision as of 22:11, 18 June 2014

Definition

Feature selection is the process of selecting the set of features of a given input datum that we will use to predict the corresponding output value. For instance, a set of features that we may use to predict the price of a house may be the number of floors, floor area, zip code, size of front porch, and number of windows.

Distinction between elementary feature selection and derived feature selection

Elementary features are features that cannot be deduced from other, simpler features already available. Derived features are features that can be deduced from other features that have already been included. The choice of derived features can be thought of as more a model selection rather than a feature selection problem, because derived features can be incorporated into the functional form rather than thought of as features. Therefore, this page concentrates on the selection of elementary features.

Considerations in feature selection

Predictive power

The goal of feature selection is to choose features that can help with predicting the output. Therefore, we should try to select features that can help with predicting the output. Note that the prediction of the output is done by the features in combination, rather than by any one feature individually, but a quick-and-dirty analysis of the sensitivity of the output to individual features can be a good rule-of-thumb for determining whether to include those features.

One approach might be to use a very simple predictive model (such as univariate linear regression or univariate logistic regression) to see how the output varies with a single feature, then use the or a goodness-of-fit measure to determine whether the feature is worth including. However, this approach might miss features that tend to be predictive only when combined with other features.

Cost of data-gathering for training examples

A supervised learning algorithm that determines the parameters for the functional form determining the output in terms of the features needs to have access to the values of the features for all the training data (semi-supervised algorithms can make do with the values for a subset of the training data). The cost of this data-gathering can be a constraint in deciding whether to include the feature or not.

In addition to the cost, the precision and accuracy of the feature value as measured also matter.

Cost of data-gathering for new instances where the value needs to be predicted

In order for a feature to be helpful for a predictive model, its value should be easy to compute for new instances where we are trying to make predictions. Therefore, features whose values are known only after a time lag (relative to when the prediction needs to be made) are not useful for predictive models, even if we have access to past data and can build a retrospectively predictive model using them.

The predictive power may be constrained by the choice of features, regardless of the power of models or learning algorithms

Once the set of features is chosen, that puts an upper bound on just how predictive the model can be. Suppose, for instance, that the output that we are trying to predict is sum sum of independent features where is normally distributed with mean and standard deviation . If we choose only and as our features, then the most we can say about the output is that it is normally distributed with mean and standard deviation . We simply cannot get more precise.

Selecting too many features could be problematic

The following are the problems with selecting too many features:

  • Collecting information on the values of the features for all the training data, as well as for the new inputs on which we are trying to make predictions, becomes harder.
  • The learning algorithm used to optimize the parameter values becomes more computationally intensive.
  • A functional form that uses too many features may suffer from overfitting.

See also

  • Feature scaling: This is a process of linear scaling typically executed after feature selection, and in some cases along with model selection. The idea is to make the ranges of values typically taken by the features roughly comparable. This is to avoid some features taking very large values and some features taking very small values, something that poses a problem for some learning algorithms and also for regularization.
  • Model selection