Continuous encoding of categorical variables trick

Definition

The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable $v$ that can take any value from a fixed finite set $S$ .

For each value in $S$ , define a binary feature that is 1 if $v$ equals that value and 0 otherwise. We get a total of $| S |$ features (i.e., as many features as the size of $S$ ) corresponding to the different elements of $S$ . This collection of features is called a feature family. Moreover, because the categorical variable $v$ can take only one value within $S$ , it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.

If the majority of features in our data matrix are constructed from categorical variables using this trick, then the data matrix will turn out to be a sparse matrix.

Note that this continuous encoding allows us to consider the following variations:

Examples where the value of the categorical variable is unknown: In this case, we just choose 0 for all the features.
Examples where the categorical variable is multi-valued: In this case, we choose 1 for all the possible values.
Examples where there is probabilistically quantified uncertainty as to the value of the categorical variable: In this case, we assign value equal to the probability of the given feature being the correct one.