Cost function: Difference between revisions

Revision as of 02:07, 9 May 2016

Definition

For a single piece of data

The cost function associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.

For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function $C\colon \mathbb {R} \times \mathbb {R} \to \mathbb {R}$ $C\colon \mathbb {R} \times \mathbb {R} \to \mathbb {R}$ of two variables $u,v\in \mathbb {R}$ $u,v\in \mathbb {R}$ (the predicted value and actual value) satisfying the following conditions:
- $C(u,u)=0$ for all $u\in \mathbb {R}$
- For $u\leq v\leq w$ , we have $C(u,v)\leq C(u,w)$ and $C(v,w)\leq C(u,w)$
The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.

For classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function $C\colon [0,1]\times \{0,1\}\to \mathbb {R}$ $C\colon [0,1]\times \{0,1\}\to \mathbb {R}$ of two variables $p\in [0,1]$ $p\in [0,1]$ and $v\in \{0,1\}$ $v\in \{0,1\}$ (the predicted probability and actual value) satisfying the following conditions:
- $C(1,1)=0$
- $C(0,0)=0$
- For $p\leq q$ , we have $C(q,1)\leq C(p,1)$
- For $p\leq q$ , we have $C(p,0)\leq C(q,0)$

For several data points

To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:

Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
Mean using $r^{th}$ powers, for some $r>1$ : We take the mean of the $r^{th}$ powers of all the cost functions, then take the $r^{th}$ root.
Maximum value

Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.

@@ Line 9: / Line 9: @@
    <ul>
      <li><math>C(u,u) = 0</math> for all <math>u \in \R</math></li>
-     <li>For <math>u \le v \le w</math>, <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math></li>
+     <li>For <math>u \le v \le w</math>, we have <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math></li>
    </ul> The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.
 </ul>
@@ Line 16: / Line 16: @@
 ** <math>C(1,1) = 0</math>
 ** <math>C(0,0) = 0</math>
-** For <math>p \le q</math>, <math>C(q,1) \le C(p,1)</math>
+** For <math>p \le q</math>, we have <math>C(q,1) \le C(p,1)</math>
-** For <math>p \le q</math>, <math>C(p,0) \le C(q,0)</math>
+** For <math>p \le q</math>, we have <math>C(p,0) \le C(q,0)</math>
 ===For several data points===