Cost function: Difference between revisions

From Machinelearning
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The '''cost function''' associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.
==Definition==
==Definition==


===For a single piece of data===
This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems.
 
===Regression problems===
 
For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function <math>C\colon \mathbb R \times \mathbb R \to \mathbb R</math> of two variables <math>u,v \in \mathbb R</math> (the predicted value and actual value) satisfying the following conditions:
 
* <math>C(u,u) = 0</math> for all <math>u \in \R</math>
* For <math>u \le v \le w</math>, we have <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math>
 
The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.


The '''cost function''' associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.
Two typical examples of cost functions are <math>C(u,v) = (u-v)^2</math> (the [[quadratic cost function]], used for least squares linear regression) and <math>C(u,v) = |u-v|</math> (used for least absolute deviations regression). With the former cost function, we see that for <math>-1,0,1 \in \mathbb R</math>, we have <math>(-1-0)^2 + (0-1)^2 = 2 \not\geq 4 = (-1-1)^2</math>, so the triangle inequality is not satisfied.
 
===Binary classification problems===
 
For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function <math>C\colon [0,1]\times \{0,1\} \to \mathbb R</math> of two variables <math>p\in [0,1]</math> and <math>v \in \{0,1\}</math> (the predicted probability and actual value) satisfying the following conditions:
 
* <math>C(1,1) = 0</math>
* <math>C(0,0) = 0</math>
* For <math>p \le q</math>, we have <math>C(q,1) \le C(p,1)</math>
* For <math>p \le q</math>, we have <math>C(p,0) \le C(q,0)</math>


<ul>
Although not a strict requirement, cost functions should be selected to be [[calculus:proper scoring rule|proper scoring rule]]s, so that they penalize accurate probabilities less than inaccurate probabilities.
  <li>For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function <math>C\colon \mathbb R \times \mathbb R \to \mathbb R</math> of two variables <math>u,v \in \mathbb R</math> (the predicted value and actual value) satisfying the following conditions:
  <ul>
    <li><math>C(u,u) = 0</math> for all <math>u \in \R</math></li>
    <li>For <math>u \le v \le w</math>, we have <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math></li>
  </ul> The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.
</ul>


* For classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function <math>C\colon [0,1]\times \{0,1\} \to \mathbb R</math> of two variables <math>p\in [0,1]</math> and <math>v \in \{0,1\}</math> (the predicted probability and actual value) satisfying the following conditions:
Typical cost functions used for binary classification problems include the [[calculus:logarithmic scoring rule|logarithmic cost function]] and the [[quadratic cost function]], both of which are proper scoring rules. Of these, the logarithmic scoring rules is the more widely used, because it is the only [[calculus:logarithmic scoring rule is the only proper scoring rule up to affine transformations in case of more than two classes|proper scoring rule when considering more than two classes]].
** <math>C(1,1) = 0</math>
** <math>C(0,0) = 0</math>
** For <math>p \le q</math>, we have <math>C(q,1) \le C(p,1)</math>
** For <math>p \le q</math>, we have <math>C(p,0) \le C(q,0)</math>


===For several data points===
==Combining the cost function values across multiple data points==


To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:
To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:


* Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
* Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
* Mean using <math>r^{th}</math> powers, for some <math>r > 1</math>: We take the mean of the <math>r^{th}</math> powers of all the cost functions, then take the <math>r^{th}</math> root.
* Mean using <math>r^{\text{th}}</math> powers, for some <math>r > 1</math>: We take the mean of the <math>r^{\text{th}}</math> powers of all the cost functions, then take the <math>r^{\text{th}}</math> root.
* Maximum value
* Maximum value


Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.
Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.

Latest revision as of 16:44, 10 September 2017

The cost function associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.

Definition

This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems.

Regression problems

For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function of two variables (the predicted value and actual value) satisfying the following conditions:

  • for all
  • For , we have and

The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.

Two typical examples of cost functions are (the quadratic cost function, used for least squares linear regression) and (used for least absolute deviations regression). With the former cost function, we see that for , we have , so the triangle inequality is not satisfied.

Binary classification problems

For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function of two variables and (the predicted probability and actual value) satisfying the following conditions:

  • For , we have
  • For , we have

Although not a strict requirement, cost functions should be selected to be proper scoring rules, so that they penalize accurate probabilities less than inaccurate probabilities.

Typical cost functions used for binary classification problems include the logarithmic cost function and the quadratic cost function, both of which are proper scoring rules. Of these, the logarithmic scoring rules is the more widely used, because it is the only proper scoring rule when considering more than two classes.

Combining the cost function values across multiple data points

To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:

  • Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
  • Mean using powers, for some : We take the mean of the powers of all the cost functions, then take the root.
  • Maximum value

Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.