The cost function associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.
This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems.
For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function of two variables (the predicted value and actual value) satisfying the following conditions:
- for all
- For , we have and
The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.
Two typical examples of cost functions are (the quadratic cost function, used for least squares linear regression) and (used for least absolute deviations regression). With the former cost function, we see that for , we have , so the triangle inequality is not satisfied.
Binary classification problems
For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function of two variables and (the predicted probability and actual value) satisfying the following conditions:
- For , we have
- For , we have
Although not a strict requirement, cost functions should be selected to be proper scoring rules, so that they penalize accurate probabilities less than inaccurate probabilities.
Typical cost functions used for binary classification problems include the logarithmic cost function and the quadratic cost function, both of which are proper scoring rules. Of these, the logarithmic scoring rules is the more widely used, because it is the only proper scoring rule when considering more than two classes.
Combining the cost function values across multiple data points
To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:
- Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
- Mean using powers, for some : We take the mean of the powers of all the cost functions, then take the root.
- Maximum value
Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.