Data matrix

Definition

The data matrix for a machine learning problem, for a set of examples and a set of features, is defined as follows. The rows of the matrix correspond to the examples, and the columns of the matrix correspond to the features. The entry in a given row and a given column is the value of the feature for that example. In particular, if there are $m$ examples and $n$ features, then the data matrix is a $m\times n$ matrix. In the context of linear regression and logistic regression problems, the data matrix is also termed the design matrix of the regression.

For supervised learning problems, in addition to the data matrix, a labels vector is provided that has the same dimension as $m$ , the number of examples.

Encoding unknown values

The data matrix may use specific conventions to encode feature values that are unknown.

Significance in linear algebra

Whether the data matrix is helpful from a linear algebra perspective, i.e., whether linear algebra operations (addition, multiplication) on the data matrix are useful algorithmically, depends on the model type we are using. For most model types, including linear regression, logistic regression, and support vector machines, and all generalized linear models, the data matrix is usable directly for linear algebra.

Note that if the data matrix allows non-numeric values (such as specific conventions to encode unknown values) any algorithm that would treat it simply as a matrix for numbers needs to be modified to account for the possibility of non-numeric values.

Relation with conceptual paradigms in computation

In an object-oriented implementation, each example would be one object of the class, and the features would be the fields of the class.
In SQL-style relational database, the rows of the data matrix would correspond to rows of the relational database and columns correspond to the columns (fields) of the table. In other words, the data matrix very literally looks the same as the table used to store it in a database.

Terminology

Tall, wide, and square

A data matrix is termed:

approximately square if the number of examples and number of features are comparable,
tall if the number of examples is substantially greater than the number of features, and
wide if the number of features is greater (roughly, twice or more) than the number of examples

In general, we expect the following:

For tall data matrices, it is likely that the matrix has close to full column rank, i.e., we have more data than features. Holding the set of features constant, the taller the matrix gets, the better the model we are likely to find (as far as generalizable error goes -- the error on the training set will get worse). Tall data matrices are the typical situation in machine learning. This is partly because we deliberately try to keep the number of features sufficiently small that the data matrix is tall.
For wide data matrices, it is likely that the matrix has close to full row rank, i.e., we have too many features and very little data to reliably infer what the features are doing. In this case, overfitting becomes a far bigger concern, and fairly aggressive regularization or other strategies may be needed to train a model with low generalizable error.

Storage and computation

Distributed storage

In practice, data matrices can be too huge to store and manipulate on a single machine. Storage is generally distributed.

The modality of distributed storage is typically by rows (examples): each node stores a different subset of the set of all examples, and stores 'all features for that subset. There are several reasons to distribute by rows rather than by columns:

The way data is collected, different examples are collected at different times and places, whereas all the features of a single example are generally collected together. It therefore makes more sense to store all features for an example together.
Conceptually, features generally have their own name and identity, whereas examples do not, and are generally anonymous. It is therefore easier to shard over examples than over features.
Distributed storage by rows agrees better with both object-oriented programming and relational database storage implementations.
Most of the algorithms used for manipulation can be parallelized better by rows.
In cases where categorical variables are encoded using sparse binary features (using the continuous encoding of categorical variables trick) we get a large sparse data matrix. The rows are all guaranteed to be sparse, with roughly similar density, whereas the columns can vary quite widely in terms of density. Distributing by rows therefore gives a better guarantee of easily having an even distribution across the different storage nodes.