Continuous encoding of categorical variables trick
Definition
The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable that can take any value from a fixed finite set .
For each value in , define a binary feature that is 1 if equals that value and 0 otherwise. We get a total of features (i.e., as many features as the size of ) corresponding to the different elements of . This collection of features is called a feature family. Moreover, because the categorical variable can take only one value within , it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.
If the majority of features in our data matrix are constructed from categorical variables using this trick, then the data matrix will turn out to be a sparse matrix.
Note that this continuous encoding allows us to consider the following variations:
- Examples where the value of the categorical variable is unknown: In this case, we just choose 0 for all the features.
- Examples where the categorical variable is multi-valued: In this case, we choose 1 for all the possible values.
- Examples where there is probabilistically quantified uncertainty as to the value of the categorical variable: In this case, we assign value equal to the probability of the given feature being the correct one.