Cost function: Difference between revisions
(7 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
The '''cost function''' associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value. | |||
==Definition== | ==Definition== | ||
This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems. | |||
===Regression problems=== | |||
For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function <math>C\colon \mathbb R \times \mathbb R \to \mathbb R</math> of two variables <math>u,v \in \mathbb R</math> (the predicted value and actual value) satisfying the following conditions: | |||
* <math>C(u,u) = 0</math> for all <math>u \in \R</math> | |||
* For <math>u \le v \le w</math>, we have <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math> | |||
The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly. | |||
Two typical examples of cost functions are <math>C(u,v) = (u-v)^2</math> (the [[quadratic cost function]], used for least squares linear regression) and <math>C(u,v) = |u-v|</math> (used for least absolute deviations regression). With the former cost function, we see that for <math>-1,0,1 \in \mathbb R</math>, we have <math>(-1-0)^2 + (0-1)^2 = 2 \not\geq 4 = (-1-1)^2</math>, so the triangle inequality is not satisfied. | |||
===Binary classification problems=== | |||
For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function <math>C\colon [0,1]\times \{0,1\} \to \mathbb R</math> of two variables <math>p\in [0,1]</math> and <math>v \in \{0,1\}</math> (the predicted probability and actual value) satisfying the following conditions: | |||
* <math>C(1,1) = 0</math> | |||
* <math>C(0,0) = 0</math> | |||
* For <math>p \le q</math>, we have <math>C(q,1) \le C(p,1)</math> | |||
* For <math>p \le q</math>, we have <math>C(p,0) \le C(q,0)</math> | |||
Although not a strict requirement, cost functions should be selected to be [[calculus:proper scoring rule|proper scoring rule]]s, so that they penalize accurate probabilities less than inaccurate probabilities. | |||
Typical cost functions used for binary classification problems include the [[calculus:logarithmic scoring rule|logarithmic cost function]] and the [[quadratic cost function]], both of which are proper scoring rules. Of these, the logarithmic scoring rules is the more widely used, because it is the only [[calculus:logarithmic scoring rule is the only proper scoring rule up to affine transformations in case of more than two classes|proper scoring rule when considering more than two classes]]. | |||
== | ==Combining the cost function values across multiple data points== | ||
To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices: | To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices: | ||
* Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes. | * Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes. | ||
* Mean using <math>r^{th}</math> powers, for some <math>r > 1</math>: We take the mean of the <math>r^{th}</math> powers of all the cost functions, then take the <math>r^{th}</math> root. | * Mean using <math>r^{\text{th}}</math> powers, for some <math>r > 1</math>: We take the mean of the <math>r^{\text{th}}</math> powers of all the cost functions, then take the <math>r^{\text{th}}</math> root. | ||
* Maximum value | * Maximum value | ||
Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters. | Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters. |
Latest revision as of 16:44, 10 September 2017
The cost function associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.
Definition
This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems.
Regression problems
For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function of two variables (the predicted value and actual value) satisfying the following conditions:
- for all
- For , we have and
The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.
Two typical examples of cost functions are (the quadratic cost function, used for least squares linear regression) and (used for least absolute deviations regression). With the former cost function, we see that for , we have , so the triangle inequality is not satisfied.
Binary classification problems
For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function of two variables and (the predicted probability and actual value) satisfying the following conditions:
- For , we have
- For , we have
Although not a strict requirement, cost functions should be selected to be proper scoring rules, so that they penalize accurate probabilities less than inaccurate probabilities.
Typical cost functions used for binary classification problems include the logarithmic cost function and the quadratic cost function, both of which are proper scoring rules. Of these, the logarithmic scoring rules is the more widely used, because it is the only proper scoring rule when considering more than two classes.
Combining the cost function values across multiple data points
To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:
- Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
- Mean using powers, for some : We take the mean of the powers of all the cost functions, then take the root.
- Maximum value
Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.