Continuous encoding of categorical variables trick: Difference between revisions
(Created page with "==Definition== The '''continuous encoding of categorical variables trick''' proceeds as follows. Consider a categorical variable <math>v</math> that can take any value from a...") |
No edit summary |
||
| Line 3: | Line 3: | ||
The '''continuous encoding of categorical variables trick''' proceeds as follows. Consider a categorical variable <math>v</math> that can take any value from a fixed finite set <math>S</math>. | The '''continuous encoding of categorical variables trick''' proceeds as follows. Consider a categorical variable <math>v</math> that can take any value from a fixed finite set <math>S</math>. | ||
===Feature set interpretation=== | |||
For each value in <math>S</math>, define a binary feature whose value is 1 if <math>v</math> equals that value and 0 otherwise. We get a total of <math>|S|</math> features (i.e., as many features as the size of <math>S</math>) corresponding to the different elements of <math>S</math>. This collection of features is called a ''feature family''. | |||
Moreover, because the categorical variable <math>v</math> can take only one value within <math>S</math>, it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s. | |||
===Vector interpretation=== | |||
We can think of the continuous encoding as an embedding of the (discrete) finite set <math>S</math> into the finite-dimensional vector space <math>\R^S</math>, where each element of <math>S</math> gets sent to the corresponding standard basis vector for that coordinate. | |||
==What this means for encoding examples== | |||
Consider a machine learning problem where each training example has a bunch of features that are categorical variables. For example, to each single item purchase from an e-commerce website, we may associate categorical variables such as: | |||
# The user ID of the buyer. Let's say there are three users, with IDs Alice, Bob, and Carol. | |||
# A unique identifier for the inventory item purchased. Let's say there are two items that can be bought: headphones and USB flash drives, and that a single purchase can only involve buying one inventory item. | |||
# The location that the item is being shipped to. Let's say shipping is to only five locations: Chicago, London, Mumbai, Shanghai, and Paris. | |||
# The mode of payment. Let's say there are only two modes of payment: credit card and PayPal. | |||
An example described in terms of the categorical variable encoding might be of the form: | |||
user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal | |||
Different examples would have different combinations of values of the categorical variables. | |||
The continuous encoding trick converts an example described in the above form to a collection of binary features. Here's how. | |||
# User ID can take three values. We therefore get a feature family with three binary features, one each for Alice, Bob, and Carol. | |||
# Inventory item can take two values. We therefore get a feature family with two binary features, one each for headphones and USB flash drives. | |||
# Location can take five values. We therefore get a feature family with five binary features, one each for Chicago, London, Mumbai, Shanghai, and Paris. | |||
# Mode of payment can take two values. We therefore get a feature family with two binary features, one each for credit card and PayPal. | |||
We thus get a total of 3 + 2 + 5 + 2 = 12 features in 4 families. | |||
With this coding, the example: | |||
user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal | |||
becomes the example with the following values for the 12 features: | |||
# User ID features: Alice = 1, Bob = 0, Carol = 0 | |||
# Inventory item: headphones = 0, USB flash drives = 1 | |||
# Location: Chicago = 0, London = 0, Mumbai = 1, Shanghai = 0, Paris = 0 | |||
# Mode of payment: credit card = 0, PayPal = 1 | |||
If we are representing this as a matrix row, and the 12 features are used as columns in the order provided, the row becomes: | |||
<math>\begin{pmatrix} 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 \\\end{pmatrix}</math> | |||
Note that this continuous encoding allows us to consider the following variations: | Note that this continuous encoding allows us to consider the following variations: | ||
Revision as of 02:27, 6 October 2014
Definition
The continuous encoding of categorical variables trick proceeds as follows. Consider a categorical variable that can take any value from a fixed finite set .
Feature set interpretation
For each value in , define a binary feature whose value is 1 if equals that value and 0 otherwise. We get a total of features (i.e., as many features as the size of ) corresponding to the different elements of . This collection of features is called a feature family.
Moreover, because the categorical variable can take only one value within , it is true that for any example, the values of the features from that family on that example include one 1 and the rest 0s.
Vector interpretation
We can think of the continuous encoding as an embedding of the (discrete) finite set into the finite-dimensional vector space , where each element of gets sent to the corresponding standard basis vector for that coordinate.
What this means for encoding examples
Consider a machine learning problem where each training example has a bunch of features that are categorical variables. For example, to each single item purchase from an e-commerce website, we may associate categorical variables such as:
- The user ID of the buyer. Let's say there are three users, with IDs Alice, Bob, and Carol.
- A unique identifier for the inventory item purchased. Let's say there are two items that can be bought: headphones and USB flash drives, and that a single purchase can only involve buying one inventory item.
- The location that the item is being shipped to. Let's say shipping is to only five locations: Chicago, London, Mumbai, Shanghai, and Paris.
- The mode of payment. Let's say there are only two modes of payment: credit card and PayPal.
An example described in terms of the categorical variable encoding might be of the form:
user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal
Different examples would have different combinations of values of the categorical variables.
The continuous encoding trick converts an example described in the above form to a collection of binary features. Here's how.
- User ID can take three values. We therefore get a feature family with three binary features, one each for Alice, Bob, and Carol.
- Inventory item can take two values. We therefore get a feature family with two binary features, one each for headphones and USB flash drives.
- Location can take five values. We therefore get a feature family with five binary features, one each for Chicago, London, Mumbai, Shanghai, and Paris.
- Mode of payment can take two values. We therefore get a feature family with two binary features, one each for credit card and PayPal.
We thus get a total of 3 + 2 + 5 + 2 = 12 features in 4 families.
With this coding, the example:
user ID = Alice, inventory item = USB flash drive, location = Mumbai, mode of payment = PayPal
becomes the example with the following values for the 12 features:
- User ID features: Alice = 1, Bob = 0, Carol = 0
- Inventory item: headphones = 0, USB flash drives = 1
- Location: Chicago = 0, London = 0, Mumbai = 1, Shanghai = 0, Paris = 0
- Mode of payment: credit card = 0, PayPal = 1
If we are representing this as a matrix row, and the 12 features are used as columns in the order provided, the row becomes:
Note that this continuous encoding allows us to consider the following variations:
- Examples where the value of the categorical variable is unknown: In this case, we just choose 0 for all the features.
- Examples where the categorical variable is multi-valued: In this case, we choose 1 for all the possible values.
- Examples where there is probabilistically quantified uncertainty as to the value of the categorical variable: In this case, we assign value equal to the probability of the given feature being the correct one.