Data matrix

Definition

The data matrix for a machine learning problem, for a set of examples and a set of features, is defined as follows. The rows of the matrix correspond to the examples, and the columns of the matrix correspond to the features. The entry in a given row and a given column is the value of the feature for that example. In particular, if there are $m$ examples and $n$ features, then the data matrix is a $m \times n$ matrix. In the context of linear regression and logistic regression problems, the data matrix is also termed the design matrix of the regression.

For supervised learning problems, in addition to the data matrix, a labels vector is provided that has the same dimension as $m$ , the number of examples.

Encoding unknown values

The data matrix may use specific conventions to encode feature values that are unknown.

Significance in linear algebra

Whether the data matrix is helpful from a linear algebra perspective, i.e., whether linear algebra operations (addition, multiplication) on the data matrix are useful algorithmically, depends on the model type we are using. For most model types, including linear regression, logistic regression, and support vector machines, and all generalized linear models, the data matrix is usable directly for linear algebra.

Note that if the data matrix allows non-numeric values (such as specific conventions to encode unknown values) any algorithm that would treat it simply as a matrix for numbers needs to be modified to account for the possibility of non-numeric values.

Terminology

Tall, wide, and square

A data matrix is termed:

approximately square if the number of examples and number of features are comparable,
tall if the number of examples is substantially greater than the number of features, and
wide if the number of features is greater (roughly, twice or more) than the number of examples

In general, we expect the following:

For tall data matrices, it is likely that the matrix has close to full column rank, i.e., we have more data than features. Holding the set of features constant, the taller the matrix gets, the better the model we are likely to find (as far as generalizable error goes -- the error on the training set will get worse). Tall data matrices are the typical situation in machine learning. This is partly because we deliberately try to keep the number of features sufficiently small that the data matrix is tall.
For wide data matrices, it is likely that the matrix has close to full row rank, i.e., we have too many features and very little data to reliably infer what the features are doing. In this case, overfitting becomes a far bigger concern, and fairly aggressive regularization or other strategies may be needed to train a model with low generalizable error.