Continuous encoding of categorical variables trick

From Machinelearning
Revision as of 02:30, 6 October 2014 by Vipul (talk | contribs)

Definition

The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable v that can take any value from a fixed finite set S.

Feature set interpretation

For each value in S, define a binary feature whose value is 1 if v equals that value and 0 otherwise. We get a total of |S| features (i.e., as many features as the size of S) corresponding to the different elements of S. This collection of features is called a feature family.

Moreover, because the categorical variable v can take only one value within S, it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.

Vector interpretation

We can think of the continuous encoding as an embedding of the (discrete) finite set S into the finite-dimensional vector space RS, where each element of S gets sent to the corresponding standard basis vector for that coordinate.

What this means for encoding examples

Consider a machine learning problem where each training example has a bunch of features that are categorical variables. For example, to each single item purchase from an e-commerce website, we may associate categorical variables such as:

  1. The user ID of the buyer. Let's say there are three users, with IDs Alice, Bob, and Carol.
  2. A unique identifier for the inventory item purchased. Let's say there are two items that can be bought: headphones and USB flash drives, and that a single purchase can only involve buying one inventory item.
  3. The location that the item is being shipped to. Let's say shipping is to only five locations: Chicago, London, Mumbai, Shanghai, and Paris.
  4. The mode of payment. Let's say there are only two modes of payment: credit card and PayPal.

An example described in terms of the categorical variable encoding might be of the form:

user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal

Different examples would have different combinations of values of the categorical variables.

The continuous encoding trick converts an example described in the above form to a collection of binary features. Here's how.

  1. User ID can take three values. We therefore get a feature family with three binary features, one each for Alice, Bob, and Carol.
  2. Inventory item can take two values. We therefore get a feature family with two binary features, one each for headphones and USB flash drives.
  3. Location can take five values. We therefore get a feature family with five binary features, one each for Chicago, London, Mumbai, Shanghai, and Paris.
  4. Mode of payment can take two values. We therefore get a feature family with two binary features, one each for credit card and PayPal.

We thus get a total of 3 + 2 + 5 + 2 = 12 features in 4 families.

With this coding, the example:

user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal

becomes the example with the following values for the 12 features:

  1. User ID features: Alice = 1, Bob = 0, Carol = 0
  2. Inventory item: headphones = 0, USB flash drives = 1
  3. Location: Chicago = 0, London = 0, Mumbai = 1, Shanghai = 0, Paris = 0
  4. Mode of payment: credit card = 0, PayPal = 1

If we are representing this as a matrix row, and the 12 features are used as columns in the order provided, the row becomes: e2π3

(100010010001)

Note that this continuous encoding allows us to consider the following variations:

  • Examples where the value of the categorical variable is unknown: In this case, we just choose 0 for all the features.
  • Examples where the categorical variable is multi-valued: In this case, we choose 1 for all the possible values.
  • Examples where there is probabilistically quantified uncertainty as to the value of the categorical variable: In this case, we assign value equal to the probability of the given feature being the correct one.