Continuous encoding of categorical variables trick

From Machinelearning

Definition

The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable that can take any value from a fixed finite set .

Feature set interpretation

For each value in , define a binary feature whose value is 1 if equals that value and 0 otherwise. We get a total of features (i.e., as many features as the size of ) corresponding to the different elements of . This collection of features is called a feature family.

Moreover, because the categorical variable can take only one value within , it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.

Vector interpretation

We can think of the continuous encoding as an embedding of the (discrete) finite set into the finite-dimensional vector space , where each element of gets sent to the corresponding standard basis vector for that coordinate.

What this means for encoding examples

Direct encoding of examples

Consider a machine learning problem where each training example has a bunch of features that are categorical variables. For example, to each single item purchase from an e-commerce website, we may associate categorical variables such as:

  1. The user ID of the buyer. Let's say there are three users, with IDs Alice, Bob, and Carol.
  2. A unique identifier for the inventory item purchased. Let's say there are two items that can be bought: headphones and USB flash drives, and that a single purchase can only involve buying one inventory item.
  3. The location that the item is being shipped to. Let's say shipping is to only five locations: Chicago, London, Mumbai, Shanghai, and Paris.
  4. The mode of payment. Let's say there are only two modes of payment: credit card and PayPal.

An example described in terms of the categorical variable encoding might be of the form:

user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal

Different examples would have different combinations of values of the categorical variables.

The continuous encoding trick converts an example described in the above form to a collection of binary features. Here's how.

  1. User ID can take three values. We therefore get a feature family with three binary features, one each for Alice, Bob, and Carol.
  2. Inventory item can take two values. We therefore get a feature family with two binary features, one each for headphones and USB flash drives.
  3. Location can take five values. We therefore get a feature family with five binary features, one each for Chicago, London, Mumbai, Shanghai, and Paris.
  4. Mode of payment can take two values. We therefore get a feature family with two binary features, one each for credit card and PayPal.

We thus get a total of 3 + 2 + 5 + 2 = 12 features in 4 families.

With this coding, the example:

user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal

becomes the example with the following values for the 12 features:

  1. User ID features: Alice = 1, Bob = 0, Carol = 0
  2. Inventory item: headphones = 0, USB flash drives = 1
  3. Location: Chicago = 0, London = 0, Mumbai = 1, Shanghai = 0, Paris = 0
  4. Mode of payment: credit card = 0, PayPal = 1

If we are representing this as a matrix row, and the 12 features are used as columns in the order provided, the row becomes:

Basically, the correspondence is:

  • Categorical feature Feature family in continuous encoding
  • Possible value for categorical feature Single feature in continuous encoding
  • Value of categorical feature for a particular example Unique feature in corresponding feature family with value 1

Note that with this type of encoding:

  • The total number of features in the continuous encoding is the sum of the sizes of the sets of values for each categorical feature.
  • For any particular example, the number of features with value 1 equals the number of feature families, which in turn equals the number of categorical features. This is constant across examples, and less than the total number of features. In the toy example above, the numbers are 4 and 12 respectively, but on large and feature-rich datasets, the former number is rarely more than a thousand, while the latter number can be in the millions or even billions.

Encoding of example pair differences (used for ranking and recommendation algorithms)

For some ranking purposes, our training data includes pairs of examples, where the goal is to learn that the first example in the pair is more preferable than the second. One way of training these is to construct our training example by taking the difference of the vectors corresponding to the examples. Such examples will include the numbers 0, 1, and -1 in the example rows.

Sparsity property

With the continuous encoding as described here, each example will have as many 1s as the number of

Variations

Note that this continuous encoding allows us to consider the following variations:

  • Examples where the value of the categorical variable is unknown: In this case, we just choose 0 for all the features.
  • Examples where the categorical variable is multi-valued: In this case, we choose 1 for all the possible values.
  • Examples where there is probabilistically quantified uncertainty as to the value of the categorical variable: In this case, we assign value equal to the probability of the given feature being the correct one.