Crossing of categorical variables

Crossing or taking crosses refers to a process where we construct new categorical variables from existing ones by taking pairs of them. For instance, if we have a categorical variable that can take values from a finite set $A$ and another categorical variable that can take values from a finite set $B$ , we can define a new categorical variable allowed to take values in $A\times B$ , with its value defined as: (value of first categorical variable, value of second categorical variable).

For instance, suppose we want to know what kinds of items people buy from an e-commerce website. We have two categorical variables: city and device type used to access the website. We can construct a new categorical variable: city X device type, that stores information about both the city and the device type.

Crosses are derived features, and are information-theoretically redundant with the elementary features. However, if we are using a generalized linear model such as linear or logistic regression, or an otherwise constrained functional form, introducing crosses can make the model more powerful by allowing for the expression of more complicated relationships.

Relation between crossing and continuous encoding

Categorical variables are encoded continuously for regression and classification problems using the continuous encoding of categorical variables trick. Here, we discuss the relationship between crossing and continuous encoding.

Suppose we have a categorical variable for city and a categorical variable for device type. Let's say there are 100 cities and 12 device types. The cross of these, namely city X device type, has 1200 possible values. However, all of these may not occur in actual examples, even if every city occurs and every device type. Thus, in practice, we may need much fewer than 1200 features for the cross. However, for simplicity, let's assume we're using all 1200 theoretically possible crosses.

From the categorical variable perspective, we have three features:

The pure (elementary) feature called city, with 100 possible values.
The pure (elementary) feature called device type, with 12 possible values.
The derived feature that is the cross city X device type, with 1200 possible values.

In the continuous encoding, the features become feature families, and the possible values become features. Thus, we have three feature families:

The elementary feature family corresponding to city, with 100 features.
The elementary feature family corresponding to device type, with 12 features.
The derived feature family corresponding to city X device type, with 1200 features.

The total number of features is 1312.

Note, however, that although the features are linearly independent, they are not algebraically independent: for any example, the value of a crossed feature is literally the product of the values of the two factor features. For instance, the value of the feature obtained by crossing the city San Francisco with the device type iPhone 6 is equal to the product of the feature values for San Francisco and iPhone 6.