Logistic regression: Difference between revisions

Revision as of 15:24, 10 September 2017

Summary

Item	Value
Type of variable predicted	Binary (yes/no)
Format of prediction	Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
Functional form of model	Computes the probability by applying the logistic function to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a generalized linear model.
Typical cost function	As with most probabilistic binary prediction models, logistic regression models are typically scored using the logarithmic cost function. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed after applying the logistic function.
Typical regularization choices	Both $L^{1}$ - and $L^{2}$ -regularization, as well as combined regularization using $L^{1}$ and $L^{2}$ terms, are common.
Learning algorithms	See here for more (to eventually fill in here).

Definition

The term logistic regression is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a classification problem than a regression problem. However, because the model shares many basic components with linear regression, and is an example of a generalized linear model, it has historically gone by the name of logistic regression.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called features). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in $[0,1]$ . By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = logistic function evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

$g(x)={\frac {1}{1+e^{-x}}}$

The values of the unknown parameters are determined empirically so as to best fit the training set.

Cost function used

The typical cost function used is the logarithmic cost function (also known as logarithmic scoring): This assigns a score of $-\log p$ if the event happened and a score of $-\log(1-p)$ if the event did not happen. The lower the score, the better. The logarithmic scoring rule is proper: if the true probability is $q$ , then the score is minimized by predicting $p=q$ >

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

Closed form expression for cost function using 0,1-encoding

Suppose we assign a label $y$ with value 0 if the event did not occur and 1 if the event occurred. Then, if $p$ is the predicted probability, the score associated with $p$ is:

$-(y\log p+(1-y)\log(1-p))$

Suppose there are $m$ data points. The probability vector is the vector ${\vec {y}}=(y_{1},y_{2},\dots ,y_{m})$ and the probability vector is the vector ${\vec {p}}=(p_{1},p_{2},\dots ,p_{m})$ . The cost function is:

${\frac {1}{m}}\left[\sum _{i=1}^{m}-(y_{i}\log p_{i}+(1-y_{i})\log(1-p_{i}))\right]$

Closed form expression for cost function using -1,1-encoding

Suppose we assign a label $l$ with value -1 if the event did not occur and 1 if the event occurred. Then, if $p$ is the predicted probability, the score associated with $p$ is:

$-{\frac {1}{2}}((1+l)\log p+(1-l)\log(1-p))$

Description as a generalized linear model

The logistic regression model can be viewed as a special case of the generalized linear model, namely a case where the link function is the logistic function and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

$g^{-1}(p)=\ln \left({\frac {p}{1-p}}\right)$

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

However, the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

Computational format

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

$m$ denotes the number of examples (data points).
$n$ denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal $n$ . The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
$X$ is the data matrix or design matrix of the regression. $X$ is a $m\times n$ matrix. Each row of $X$ corresponds to one example. Each column of $X$ corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
The vector of labels (or actual outputs) is a $m$ -dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the $\{-1,1\}$ -convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by ${\vec {y}}$ and the latter by ${\vec {l}}$ . We have the relations $y_{i}=(1+l_{i})/2$ , and $l_{i}=2y_{i}-1$ , for all $i$ .
The parameter vector is a $n$ -dimensional vector. We will denote it as ${\vec {\theta }}$ .

The predicted probability vector is given as:

${\vec {p}}=g(X{\vec {\theta }})$

where $g$ is the logistic function and is applied coordinate-wise.

Relation with other forms of machine learning

Linear regression

Logistic regression and linear regression are related in the following ways:

Aspect	How they're similar	How they're different
Generalized linear models, so linear dependence on inputs	Both are examples of generalized linear models	For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the logistic function and the typical choice of cost function is the logarithmic cost function.
Prediction of continuous variables	Prima facie, both of them output variables that take continuous values	Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the probability associated with a binary classification problem.

Support vector machines

Logistic regression and the support vector machine (SVM) regression method are related in the following ways:

Aspect	How they're similar	How they're different
Binary classification	Both logistic regression and support vector model are approaches to tackling binary classification.	Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines can be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.

Artificial neural networks

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called artificial neurons, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

Maximum entropy (MaxEnt) models

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

@@ Line 10: / Line 10: @@
 | Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]].
 |-
-| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
+| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
 |-
 | Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.