Continuous encoding of categorical variables trick

Definition

The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable $v$ that can take any value from a fixed finite set $S$ .

Feature set interpretation

For each value in $S$ , define a binary feature whose value is 1 if $v$ equals that value and 0 otherwise. We get a total of $|S|$ features (i.e., as many features as the size of $S$ ) corresponding to the different elements of $S$ . This collection of features is called a feature family.

Moreover, because the categorical variable $v$ can take only one value within $S$ , it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.

Vector interpretation

We can think of the continuous encoding as an embedding of the (discrete) finite set $S$ into the finite-dimensional vector space $\mathbb {R} ^{S}$ , where each element of $S$ gets sent to the corresponding standard basis vector for that coordinate.

What this means for encoding examples

Direct encoding of examples

Consider a machine learning problem where each training example has a bunch of features that are categorical variables. For example, to each single item purchase from an e-commerce website, we may associate categorical variables such as:

The user ID of the buyer. Let's say there are three users, with IDs Alice, Bob, and Carol.
A unique identifier for the inventory item purchased. Let's say there are two items that can be bought: headphones and USB flash drives, and that a single purchase can only involve buying one inventory item.
The location that the item is being shipped to. Let's say shipping is to only five locations: Chicago, London, Mumbai, Shanghai, and Paris.
The mode of payment. Let's say there are only two modes of payment: credit card and PayPal.

An example described in terms of the categorical variable encoding might be of the form:

user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal

Different examples would have different combinations of values of the categorical variables.

The continuous encoding trick converts an example described in the above form to a collection of binary features. Here's how.

User ID can take three values. We therefore get a feature family with three binary features, one each for Alice, Bob, and Carol.
Inventory item can take two values. We therefore get a feature family with two binary features, one each for headphones and USB flash drives.
Location can take five values. We therefore get a feature family with five binary features, one each for Chicago, London, Mumbai, Shanghai, and Paris.
Mode of payment can take two values. We therefore get a feature family with two binary features, one each for credit card and PayPal.

We thus get a total of 3 + 2 + 5 + 2 = 12 features in 4 families.

With this coding, the example:

user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal

becomes the example with the following values for the 12 features:

User ID features: Alice = 1, Bob = 0, Carol = 0
Inventory item: headphones = 0, USB flash drives = 1
Location: Chicago = 0, London = 0, Mumbai = 1, Shanghai = 0, Paris = 0
Mode of payment: credit card = 0, PayPal = 1

If we are representing this as a matrix row, and the 12 features are used as columns in the order provided, the row becomes:

$[\ 1\ 0\ 0\ 0\ 1\ 0\ 0\ 1\ 0\ 0\ 0\ 1\ ]$

Basically, the correspondence is:

Categorical feature $\leftrightarrow$ Feature family in continuous encoding
Possible value for categorical feature $\leftrightarrow$ Single feature in continuous encoding
Value of categorical feature for a particular example $\leftrightarrow$ Unique feature in corresponding feature family with value 1

Note that with this type of encoding:

The total number of features in the continuous encoding is the sum of the sizes of the sets of values for each categorical feature.
For any particular example, the number of features with value 1 equals the number of feature families, which in turn equals the number of categorical features. This is constant across examples, and less than the total number of features. In the toy example above, the numbers are 4 and 12 respectively, but on large and feature-rich datasets, the former number is rarely more than a thousand, while the latter number can be in the millions or even billions.

Encoding of example pair differences (used for ranking and recommendation algorithms)

For some ranking purposes, our training data includes pairs of examples, where the goal is to learn that the first example in the pair is more preferable than the second. One way of training these is to construct our training example by taking the difference of the vectors corresponding to the examples. Such examples will include the numbers 0, 1, and -1 in the example rows.

Variations to handle possible problems

A number of problems can occur with the above approach. Some solutions are discussed below.

Problem

Potential solutions

The value of a categorical variable is unknown for an example.

Solution (1): Simply mark all the binary features are zero. The problem with this solution is that it disturbs the mathematical structure of the problem: the number of 1s in each example row is no longer constant across examples.
Solution (2): Add an extra feature in each family, for "unknown feature value", and mark that feature as 1 for the examples where the value is unknown. This is a more robust solution than (1) in all respects.
Solution (3): If we have some prior for what value the categorical variable could have, use that prior to assign probabilistic values to the corresponding features. For instance, if 80% of item sales are for headphones and 20% are for USB flash drives, we could assign a weight of 0.8 to headphones and 0.2 to USB flash drives if the item sold is unknown. This has some advantages, but also some disadvantages, relative to approach (2). The disadvantages are that we need information to assign probabilities, the matrix need not remain sparse, and the entries are no longer just 0s and 1s. Moreover, we should consider it the job of the machine learning algorithm to learn the probable item values rather than trying to infer that when constructing the examples.

The categorical variable is multi-valued, i.e., for some examples, it can take multiple values.

Solution (1): Put a value of 1 for the features corresponding to each of the values the categorical variable takes. Disadvantage: The number of 1s is no longer constant across examples. Also, sparsity is destroyed if some categorical variables can take a very large number of values for some examples.
Solution 2: Put a value of

1/k

on each of the features corresponding to each of the values the categorical variable takes, where

k

is the number of values. Disadvantage: The matrix is no longer a 0-1 matrix. Also, the sparsity problem is common with Solution (1).
Solution 3: Pick a random value from the values the categorical variable takes, and just use that. Disadvantage: We are throwing away potentially valuable information.