Continuous encoding of categorical variables trick

From Machinelearning
Revision as of 19:13, 15 August 2014 by Vipul (talk | contribs) (Created page with "==Definition== The '''continuous encoding of categorical variables trick''' proceeds as follows. Consider a categorical variable <math>v</math> that can take any value from a...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Definition

The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable v that can take any value from a fixed finite set S.

For each value in S, define a binary feature that is 1 if v equals that value and 0 otherwise. We get a total of |S| features (i.e., as many features as the size of S) corresponding to the different elements of S. This collection of features is called a feature family. Moreover, because the categorical variable v can take only one value within S, it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.

If the majority of features in our data matrix are constructed from categorical variables using this trick, then the data matrix will turn out to be a sparse matrix.

Note that this continuous encoding allows us to consider the following variations:

  • Examples where the value of the categorical variable is unknown: In this case, we just choose 0 for all the features.
  • Examples where the categorical variable is multi-valued: In this case, we choose 1 for all the possible values.
  • Examples where there is probabilistically quantified uncertainty as to the value of the categorical variable: In this case, we assign value equal to the probability of the given feature being the correct one.