Machinelearning - User contributions [en]

MediaWiki:Sitenotice

2024-10-06T17:46:27Z

Vipul:

Want site search autocompletion? See [[Project:Enabling site search autocompletion|here]] 
Encountering 429 Too Many Requests errors when browsing the site? See [[Project:429 Too Many Requests error|here]]

User:Vipul/Sandbox

2024-10-06T17:45:53Z

Vipul:

* <math>(7 + 2)^\sqrt{9} = 729</math>
* <math>5^{1 + 2} = 125</math>
* <math>6^{2 + 1} = 216</math>
* <math>(3 + 4)^3 = 343</math>
* <math>(4/2)^{10} = 1024</math>

MediaWiki:Sitenotice

2024-09-30T01:20:22Z

Vipul:

'''This site is in the process of being migrated to a new server. Edits made until this notice has been removed may be lost.''' 
Want site search autocompletion? See [[Project:Enabling site search autocompletion|here]] 
Encountering 429 Too Many Requests errors when browsing the site? See [[Project:429 Too Many Requests error|here]]

User:Vipul/Sandbox

2024-09-06T00:35:53Z

Vipul:

* <math>(7 + 2)^\sqrt{9} = 729</math>
* <math>5^{1 + 2} = 125</math>
* <math>6^{2 + 1} = 216</math>
* <math>(3 + 4)^3 = 343</math>

MediaWiki:Sitenotice

2024-09-06T00:31:25Z

Vipul:

Machinelearning:Enabling site search autocompletion

2024-09-06T00:30:42Z

Vipul: Created page with "Content copied from Ref:Ref:Enabling site search autocompletion. Images used are specific to this site (Machinelearning). Site search autocompletion is currently broken by default on this site. This page includes details on how to get it to work, and what's going on. ==What's wrong with site search autocompletion and how to fix it== ===What's wrong=== When you start typing something in the site search bar, you'll see it stuck at "Loading search suggestions" as sh..."

Content copied from [[Ref:Ref:Enabling site search autocompletion]]. Images used are specific to this site (Machinelearning).

Site search autocompletion is currently broken by default on this site. This page includes details on how to get it to work, and what's going on.

==What's wrong with site search autocompletion and how to fix it==

===What's wrong===

When you start typing something in the site search bar, you'll see it stuck at "Loading search suggestions" as shown in the screenshot below:

[[File:Site search autocompletion broken.png]]

Note that the actual search is still working -- you just have to hit Enter after typing the search query and it'll go to the search results page. It's the autocompletion before you hit Enter that is broken.

===How to fix it===

To fix it, you need to follow these steps:

* Write to vipulnaik1@gmail.com asking for a login to the site. Please include the following with your request: preferred username, preferred initial password (you can change it after logging in), real name (if you want it entered), email address to use (if you want an actual email address by which you can be contacted), and whether you want edit access as well. You don't need edit access for enabling site search autocompletion.
* Log in to the site. Then go to [[Special:Preferences]]. Go to the Appearance section and switch the Skin from "Vector (2022)" to "Vector legacy (2010)".
* Make sure to hit "Save" at the bottom.
* Now you can reload the page or load a new page.

Site search autocompletion should now work. Here's an example:

[[File:Site search autocompletion working.png]]

==More background==

We've recently upgraded the MediaWiki version of this wiki from 1.35.13 to 1.41.2 (see [[Special:Version]]). The upgrade allows us to migrate the wiki to a more modern operating system version running PHP 8. With the current setup for MediaWiki 1.41.2, we're in this situation:

* The "Vector legacy (2010)" skin has site search autocompletion working, but it doesn't render well on small screens. Specifically, even on small mobile screens, it still shows the left menu, and doesn't properly use the MobileFrontend extension settings.
* The "Vector (2022)" skin doesn't have site search autocompletion working (see screenshots in preceding section) but it does render fine on mobile devices.

It is possible to set only one default skin (that is applicable to all non-logged-in users and is the default for logged-in users who have not configured a skin for themselves). So, the selection of default skin comes down to whether it's more important for casual users to have the mobile experience working or to have site search autocompletion working. Based on a general understanding of user behavior, we believe that having a usable mobile experience is more important for casual users than having site search autocompletion.

However, for power users who are using the site extensively, site search autocompletion may be important. That's why we've written this page giving guidance on how to set up site search autocompletion.

File:Site search autocompletion working.png

2024-09-06T00:30:33Z

Vipul:

File:Site search autocompletion broken.png

2024-09-06T00:30:12Z

Vipul:

Machinelearning:429 Too Many Requests error

2024-09-06T00:27:24Z

Vipul: Created page with "This content is copied from Ref:Ref:429 Too Many Requests error. If you get a 429 Too Many Requests error when browsing this site, read on. You're probably seeing this error because a large number of requests have been made from your IP address over a short period of time. That's probably a lot of requests from you or others who share your IP address (such as your home wi-fi network). Waiting a minute and then retrying should generally work. If you are an actual h..."

This content is copied from [[Ref:Ref:429 Too Many Requests error]].

If you get a 429 Too Many Requests error when browsing this site, read on.

You're probably seeing this error because a large number of requests have been made from your IP address over a short period of time. That's probably a lot of requests from you or others who share your IP address (such as your home wi-fi network). Waiting a minute and then retrying should generally work.

If you are an actual human being with a legitimate reason to be browsing the site heavily, first, thank you and sorry about this! We set rate limits to prevent bots, spiders, spammers, and malicious actors from consuming too much of our server's resources so that our server's resources can be devoted to real humans like you. Consider writing to vipulnaik1@gmail.com with your IP address to have the IP address whitelisted. You can get your IP address by [https://www.google.com/search?q=my+ip+address Googling "my IP address"] (scroll down a little bit to where Google includes the IP address in a box). NOTE: If you have both an IPv4 address and an IPv6 address, you should send both; the server supports both IPv4 and IPv6, so either may end up getting used. To check if you have an IPv6 address, try visiting [https://ipv6.google.com/ ipv6.google.com].

If your IP address changes, or you are away from your home network, then you'll get rate-limited again. So if you find yourself getting rate-limited after already having been whitelisted, check if you are on a different IP address than the one for which you requested whitelisting.

MediaWiki:Sitenotice

2024-09-06T00:25:50Z

Vipul: Blanked the page

User:Vipul/Sandbox

2024-09-06T00:24:04Z

Vipul:

* <math>(7 + 2)^\sqrt{9} = 729</math>
* <math>5^{1 + 2} = 125</math>
* <math>6^{2 + 1} = 216</math>

User:Vipul/Sandbox

2024-09-06T00:20:10Z

Vipul:

* <math>(7 + 2)^\sqrt{9} = 729</math>
* <math>5^{1 + 2} = 125</math>

User:Vipul/Sandbox

2024-09-06T00:12:52Z

Vipul:

* <math>(7 + 2)^\sqrt{9} = 729</math>

MediaWiki:Sitenotice

2024-09-05T23:59:58Z

Vipul: Created page with "'''This wiki is in the process of being upgraded. The site may go down intermittently. Please try to avoid editing until this notice has been removed.'''"

'''This wiki is in the process of being upgraded. The site may go down intermittently. Please try to avoid editing until this notice has been removed.'''

User:Vipul/Sandbox

2024-07-06T04:52:14Z

Vipul:

User:Vipul/Sandbox

2024-05-05T20:03:57Z

Vipul:

Artificial neural network

2021-06-12T03:00:03Z

Vipul: /* Topology of the neural network */

==Definition==

An '''artificial neural network''' is a collection of nodes, each of which is an [[artificial neuron]], as well as directed edges between the nodes, called connections, each of which has a weight.

The output of each artificial neuron flows out from it on each of the connections ''from'' that node and become an input to the node the connection goes ''to''. The weight applied to that input is the weight on the connection. Each node also has an associated bias term that is the weight applied to the input 1 (not corresponding to any connection).

Nodes that have no connections going to them represent the inputs to the neural network; these correspond to the inputs or features for the problem being modeled through the neural network. Nodes that have no connections going out from them represent the outputs of the neural network, these correspond to the labels we are trying to predict.

==Things to decide in a neural network==

Let's say we want to design a neural network with <math>m</math> inputs and <math>n</math> outputs being predicted. There are a few main things to decide about the neural network.

=== Activation function of the neural network ===

We need to select the activation function that will be used for the artificial neurons. In principle, a different activation function could be used for each neuron; in practice neural networks choose a single activation function across all neurons.

=== Topology of the neural network ===

The topology of the neural network is the diagram of the nodes and their connections -- but without information on the weights and the activation function used. This topology might include connections for which we ultimately decide to set a weight of zero (and that we can therefore remove).

{| class="sortable" border="1"
! Type of network !! Description of topology !! Implications for learning
|-
| Single-layer neural network (no hidden layers)|| There are no nodes other than the input and output nodes. The input nodes feed directly into the output nodes. || This basically involves just doing machine learning on the activation function. For instance, if the activation function is a logistic function, it means running a [[logistic regression]] separately on each of the outputs.
|-
| Single-hidden-layer (feedforward) neural network || There is no direct connection between input and output nodes. There is a single hidden layer of neurons; all connections are either from inputs to the hidden layer or from the hidden layer to the outputs. In the "fully connected" version, ''all'' the permitted connections exist. || Backpropagation becomes serious business here.
|}

Information about the problem domain may go into the design of the topology of the neural network.

=== Weights of the neural network (including bias terms) ===

The weights of the neural network (including weights on connections as well as bias terms) are the most detailed and specific part of the description of the neural network.

These weights are generally learned by running a gradient descent / backpropagation algorithm on training data.

Artificial neural network

2021-06-12T02:59:45Z

Vipul: /* Things to decide in a neural network */

==Definition==

An '''artificial neural network''' is a collection of nodes, each of which is an [[artificial neuron]], as well as directed edges between the nodes, called connections, each of which has a weight.

The output of each artificial neuron flows out from it on each of the connections ''from'' that node and become an input to the node the connection goes ''to''. The weight applied to that input is the weight on the connection. Each node also has an associated bias term that is the weight applied to the input 1 (not corresponding to any connection).

Nodes that have no connections going to them represent the inputs to the neural network; these correspond to the inputs or features for the problem being modeled through the neural network. Nodes that have no connections going out from them represent the outputs of the neural network, these correspond to the labels we are trying to predict.

==Things to decide in a neural network==

Let's say we want to design a neural network with <math>m</math> inputs and <math>n</math> outputs being predicted. There are a few main things to decide about the neural network.

=== Activation function of the neural network ===

We need to select the activation function that will be used for the artificial neurons. In principle, a different activation function could be used for each neuron; in practice neural networks choose a single activation function across all neurons.

=== Topology of the neural network ===

The topology of the neural network is the diagram of the nodes and their connections -- but without information on the weights and the activation function used. This topology might include connections for which we ultimately decide to set a weight of zero (and that we can therefore remove).

{| class="sortable" border="1"
! Type of network !! Description of topology !! Implications for learning
|-
| Single-layer neural network (no hidden layers)|| There are no nodes other than the input and output nodes. The input nodes feed directly into the output nodes. || This basically involves just doing machine learning on the activation function. For instance, if the activation function is a logistic function, it means running a [[logistic regression]] separately on each of the outputs.
|-
| Single-hidden-layer (feedforward) neural network || There is no direct connection between input and output nodes. There is a single hidden layer of neurons; all connections are either from inputs to the hidden layer or from the hidden layer to the outputs. In the "fully connected" version, ''all'' the permitted connections exist.
|}

Information about the problem domain may go into the design of the topology of the neural network.

=== Weights of the neural network (including bias terms) ===

The weights of the neural network (including weights on connections as well as bias terms) are the most detailed and specific part of the description of the neural network.

These weights are generally learned by running a gradient descent / backpropagation algorithm on training data.

Artificial neural network

2021-06-12T02:59:34Z

Vipul: Created page with "==Definition== An '''artificial neural network''' is a collection of nodes, each of which is an artificial neuron, as well as directed edges between the nodes, called con..."

==Definition==

An '''artificial neural network''' is a collection of nodes, each of which is an [[artificial neuron]], as well as directed edges between the nodes, called connections, each of which has a weight.

The output of each artificial neuron flows out from it on each of the connections ''from'' that node and become an input to the node the connection goes ''to''. The weight applied to that input is the weight on the connection. Each node also has an associated bias term that is the weight applied to the input 1 (not corresponding to any connection).

Nodes that have no connections going to them represent the inputs to the neural network; these correspond to the inputs or features for the problem being modeled through the neural network. Nodes that have no connections going out from them represent the outputs of the neural network, these correspond to the labels we are trying to predict.

==Things to decide in a neural network==

Let's say we want to design a neural network with <math>m<math> inputs and <math>n</math> outputs being predicted. There are a few main things to decide about the neural network.

=== Activation function of the neural network ===

We need to select the activation function that will be used for the artificial neurons. In principle, a different activation function could be used for each neuron; in practice neural networks choose a single activation function across all neurons.

=== Topology of the neural network ===

The topology of the neural network is the diagram of the nodes and their connections -- but without information on the weights and the activation function used. This topology might include connections for which we ultimately decide to set a weight of zero (and that we can therefore remove).

{| class="sortable" border="1"
! Type of network !! Description of topology !! Implications for learning
|-
| Single-layer neural network (no hidden layers)|| There are no nodes other than the input and output nodes. The input nodes feed directly into the output nodes. || This basically involves just doing machine learning on the activation function. For instance, if the activation function is a logistic function, it means running a [[logistic regression]] separately on each of the outputs.
|-
| Single-hidden-layer (feedforward) neural network || There is no direct connection between input and output nodes. There is a single hidden layer of neurons; all connections are either from inputs to the hidden layer or from the hidden layer to the outputs. In the "fully connected" version, ''all'' the permitted connections exist.
|}

Information about the problem domain may go into the design of the topology of the neural network.

=== Weights of the neural network (including bias terms) ===

The weights of the neural network (including weights on connections as well as bias terms) are the most detailed and specific part of the description of the neural network.

These weights are generally learned by running a gradient descent / backpropagation algorithm on training data.

Artificial neuron

2021-06-12T02:34:02Z

Vipul: /* Common choices of activation function */

==Definition==

An '''artificial neuron''' is a function of the form:

<math>y = \varphi\left(\sum_{j=0}^m w_j x_j\right)</math>

where <math>w_j</math> are the weights on the neuron and <math>\varphi</math> is the activation function. Artificial neurons form components of [[artificial neural network]]s: an artificial neural network is obtained by composing and combining artificial neurons (i.e., using the outputs of some neurons as inputs for other neurons).

Generally, in [[machine learning]] problems, the topology of the artificial neural network, as well as the choice of activation function for each neuron, are fixed in advance. The values of the weights are discovered using the [[training set]] by minimizing an appropriately chosen cost function.

===Bias term===

Generally, the variable <math>x_0</math> is always taken to be <math>+1</math>, and called the ''bias term''. The weight <math>w_0</math> is the bias weight.

===Purpose of the weights===

The purpose of the weights is to combine the inputs in a way that extracts some information from all of them.

===Purpose of the activation function===

The purpose of the activation function is to rescale in a manner that extracts the relevant valuable information from the linear combination. In general, the activation function tends to squish the domain down to a smaller subset. The idea is that the goal of the neuron is closer to a classification problem than a problem of finding an exact magnitude, so very large values should get squished down to the same value as intermediate values.

For instance, suppose a self-driving car is trying to determine whether a particular segment of the picture frame represents paved road or a sidewalk. The degree of certainty that the picture is of paved road can be described by a probability that can range from 0 to 1. We may compute this probability using a [[logistic regression]] problem: we combine a lot of different pieces of information about the picture frame to compute a real number describing the log-odds of it being paved road, then apply the logistic function to compute the probability. Here, the logistic function plays the role of the activation function.

For an artificial neural network to have some power beyond a single artificial neuron, we ''must'' have a nonlinear activation function, because composing linear functions just gives us a linear function.

=== Common choices of activation function ===

{| class="sortable" border="1"
! Name of artificial neuron type !! Choice of activation function !! Mathematical form !! More information
|-
| Linear threshold unit or McCulloch-Pitts neuron || Heaviside step function (zero if less than a threshold, one if above the threshold) || for threshold <math>\theta</math>: 0 if <math>\sum_{j=0}^m w_j x_j < \theta</math>, 1 if <math>\sum_{j=0}^m w_j x_j < \theta</math>, 1/2 if <math>\sum_{j=0}^m w_j x_j = \theta</math> || This is not continuous at the threshold <math>\theta</math>; geometrically the region of discontinuity is a hyperplane. Linear threshold units are good for implementing boolean functions.
|-
| Logistic neuron || [[calculus:Logistic function|logistic function]] || <math>g\left(\sum_{j=0}^m w_j x_j = \theta \right)</math> where <math>g</math> is the logistic function <math>g(t) = \frac{1}{1 + e^{-t}}</math> || An artificial neural network with just one logistic neuron is equivalent to [[logistic regression]]. The continuity and in fact infinite differentiability of the logistic function makes it amenable to gradient descent / backpropagation methods. Artificial neural networks where all neurons are logistic neurons are commonly used in practice.
|}

Artificial neuron

2021-06-12T02:33:26Z

Vipul: /* Purpose of the activation function */

==Definition==

An '''artificial neuron''' is a function of the form:

<math>y = \varphi\left(\sum_{j=0}^m w_j x_j\right)</math>

where <math>w_j</math> are the weights on the neuron and <math>\varphi</math> is the activation function. Artificial neurons form components of [[artificial neural network]]s: an artificial neural network is obtained by composing and combining artificial neurons (i.e., using the outputs of some neurons as inputs for other neurons).

Generally, in [[machine learning]] problems, the topology of the artificial neural network, as well as the choice of activation function for each neuron, are fixed in advance. The values of the weights are discovered using the [[training set]] by minimizing an appropriately chosen cost function.

===Bias term===

Generally, the variable <math>x_0</math> is always taken to be <math>+1</math>, and called the ''bias term''. The weight <math>w_0</math> is the bias weight.

===Purpose of the weights===

The purpose of the weights is to combine the inputs in a way that extracts some information from all of them.

===Purpose of the activation function===

The purpose of the activation function is to rescale in a manner that extracts the relevant valuable information from the linear combination. In general, the activation function tends to squish the domain down to a smaller subset. The idea is that the goal of the neuron is closer to a classification problem than a problem of finding an exact magnitude, so very large values should get squished down to the same value as intermediate values.

For instance, suppose a self-driving car is trying to determine whether a particular segment of the picture frame represents paved road or a sidewalk. The degree of certainty that the picture is of paved road can be described by a probability that can range from 0 to 1. We may compute this probability using a [[logistic regression]] problem: we combine a lot of different pieces of information about the picture frame to compute a real number describing the log-odds of it being paved road, then apply the logistic function to compute the probability. Here, the logistic function plays the role of the activation function.

For an artificial neural network to have some power beyond a single artificial neuron, we ''must'' have a nonlinear activation function, because composing linear functions just gives us a linear function.

=== Common choices of activation function ===

{| class="sortable" border="1"
! Name of artificial neuron type !! Choice of activation function !! Mathematical form !! More information
|-
| Linear threshold unit or McCulloch-Pitts neuron || Heaviside step function (zero if less than a threshold, one if above the threshold) || for threshold <math>\theta</math>: 0 if <math>\sum_{j=0}^m w_j x_j < \theta</math>, 1 if <math>\sum_{j=0}^m w_j x_j < \theta</math>, 1/2 if <math>\sum_{j=0}^m w_j x_j = \theta</math> || This is not continuous at the threshold <math>\theta</math>; geometrically the region of discontinuity is a hyperplane. Linear threshold units are good for implementing boolean functions.
|-
| Logistic neuron || [[calculus:Logistic function|logistic function]] || <math>g\left(\sum_{j=0}^m w_j x_j = \theta \right)</math> where <math>g</math> is the logistic function || An artificial neural network with just one logistic neuron is equivalent to [[logistic regression]]. The continuity and in fact infinite differentiability of the logistic function makes it amenable to gradient descent / backpropagation methods. Artificial neural networks where all neurons are logistic neurons are commonly used in practice.
|}

Artificial neuron

2021-06-12T01:56:51Z

Vipul:

==Definition==

An '''artificial neuron''' is a function of the form:

<math>y = \varphi\left(\sum_{j=0}^m w_j x_j\right)</math>

where <math>w_j</math> are the weights on the neuron and <math>\varphi</math> is the activation function. Artificial neurons form components of [[artificial neural network]]s: an artificial neural network is obtained by composing and combining artificial neurons (i.e., using the outputs of some neurons as inputs for other neurons).

Generally, in [[machine learning]] problems, the topology of the artificial neural network, as well as the choice of activation function for each neuron, are fixed in advance. The values of the weights are discovered using the [[training set]] by minimizing an appropriately chosen cost function.

===Bias term===

Generally, the variable <math>x_0</math> is always taken to be <math>+1</math>, and called the ''bias term''. The weight <math>w_0</math> is the bias weight.

===Purpose of the weights===

The purpose of the weights is to combine the inputs in a way that extracts some information from all of them.

===Purpose of the activation function===

The purpose of the activation function is to rescale in a manner that extracts the relevant valuable information from the linear combination. In general, the activation function tends to squish the domain down to a smaller subset. The idea is that the goal of the neuron is closer to a classification problem than a problem of finding an exact magnitude, so very large values should get squished down to the same value as intermediate values.

For instance, suppose a self-driving car is trying to determine whether a particular segment of the picture frame represents paved road or a sidewalk. The degree of certainty that the picture is of paved road can be described by a probability that can range from 0 to 1. We may compute this probability using a [[logistic regression]] problem: we combine a lot of different pieces of information about the picture frame to compute a real number describing the log-odds of it being paved road, then apply the logistic function to compute the probability. Here, the logistic function plays the role of the activation function.

A few other remarks:

* The logistic function is a fairly common choice of activation function, and the default [[artificial neural network]] architecture uses logistic functions at all artificial neurons, so we can view artificial neural networks as generalizations of [[logistic regression]].
* Activation functions such as the logistic function, and most others that are typically chosen, have the property that for generally nice inputs, they are likely to simulate some form of almost-binary logic, and the artificial neural network can be viewed as a slight fuzzification of what is essentially a Boolean circuit.
* For an artificial neural network to have some power beyond a single artificial neuron, we ''must'' have a nonlinear activation function, because composing linear functions just gives us a linear function.

Supervised learning

2021-06-12T01:50:47Z

Vipul: /* Steps of supervised learning */

==Definition==

The term '''supervised learning''' is used to describe a subclass of machine learning problems where we are provided with a set of labeled examples for training and can use that data to determine a function that would take any new example and predict the label for that. There are two types of supervised learning techiques, [[classification]] and [[regression]].

Here, the term "example" refers to the input part of the function (that can be used to make the prediction) and the term "label" refers to the output of the function (that needs to be predicted).

==Steps of supervised learning==

Supervised learning is a process that goes through several steps, which are presented here in a table.

{| class="sortable" border="1"
! Aspect !! What gets chosen here !! Description
|-
| [[Feature selection]] || The set of features that the model depends on. || Based on the problem domain, we come up with a list of relevant features that affect the output function. If we choose too few features, then the task might be theoretically impossible. For instance, if the only feature we have for a house is its area, and we need to predict the price, we cannot do the prediction too well. The more the features, the better our ability to predict in principle. However, too many features means more effort spent collecting their values, and there are also dangers of [[overfitting]].
|-
| [[Model class selection]] (not to be confused with [[hyperparameter optimization]]) || The functional form (with [[parameter]]s) describing how the output depends on the features. || This again depends on theoretical knowledge based on the problem domain, as well as empirical exploration of the data gathered.
|-
| [[Cost function selection]] || The cost function (or error function) used to measure error on new data. || This again depends on theoretical knowledge based on the problem domain, and also on the choice of model type. Often, the error function selection is bundled with the model class selection, because part of the model class selection process also includes identifying the nature of distribution of errors or anomalies. This has the property that if we choose parameters so that our predicted function matches the actual function precisely, the error is zero. In principle, however, the error function is independent of the model class, so we can essentially combine any permissible model class with any permissible error function.
|-
| Regularization-type choices || The choice of regularization function to add to the cost function when using on the training data. Requires choosing [[regularization hyperparameter]](s). ||
|-
| [[Learning algorithm]] applied on the training data || The values of the parameters (not uniquely determined, we might get a portfolio of choices for different hyperparameter choices)|| This is the algorithm that tries to solve the optimization function of choosing values of the parameters (for our chosen model) so that the error function is minimized (or close to minimized). Note that we are trying to minimize the error function for unknown inputs, but our algorithm is being trained on known inputs. Thus, there are issues of [[overfitting]]. This problem is addressed through a number of techniques, including [[regularization]] and [[early stopping]].
|-
| [[Cross-validation]] (relates to hyperparameter optimization) || The actual values of the hyperparameters and parameters || This includes techniques that tweak the [[hyperparameter]]s that control the performance of the learning algorithm. These could include learning rate parameters for gradient descent, or regularization parameters introduced to avoid overfitting.
|}

Machine learning

2021-06-12T01:49:06Z

Vipul: /* Classification based on nature of training data */

==Definition==

'''Machine learning''' can be defined as the problem of using partial data about a function or relation to make predictions about how the function or relation would perform on new data. It is the branch of AI that covers its statistical part.

===The case of learning a function===

Although this particular formulation is not the most natural formulation of all machine learning problems, any machine learning problem can be converted to this formulation.

An output function depends on some inputs (called [[feature]]s) (and possibly a random noise component). We have access to some training data, or thee ability to explore and discover training data, where we may or may not have the value of the output function, and we either already have or can get the values of the features. The machine learning problem asks for an explicit description of a function that would perform well on new inputs that we provide to it.

==Types of machine learning==

===Classification based on nature of training data===

{| class="sortable" border="1"
! Type of machine learning !! Description
|-
| [[supervised learning]] || Explicit training data (input-output pairs) are provided. These can be used to learn the parameters for the functional form that can then be used to make predictions.
|-
| [[unsupervised learning]] || The training data as provided includes a list of inputs but does not provide explicit outputs for explicit inputs. For instance, in the context of classification problems, an unsupervised learning problem would simply provide a lot of inputs without specifying the output values for those inputs. The job of the machine learning algorithm is to use the distribution of the inputs to figure out the output values.
|-
| [[semi-supervised learning]] || The training data is a mix of explicit input-output pairs and input-only data. Semi-supervised learning combines some of the aspects of supervised learning and unsupervised learning.
|-
| [[reinforcement learning]] || Unlike supervised learning, the training data is not chosen by the user; instead, the agent being trained generates the training data by interacting with the environment. In addition, the kind of feedback is different: in supervised learning the feedback is ''instructive'' (i.e. does not depend on the outputs the system selects), but in reinforcement learning the feedback is ''evaluative'' (i.e. depends on the action selected by the agent).<ref>Sutton and Barto. ''Reinforcement Learning: An Introduction'' (2nd ed). p. 25.</ref><ref>[https://www.coursera.org/learn/fundamentals-of-reinforcement-learning/lecture/PtVBs/sequential-decision-making-with-evaluative-feedback "Sequential Decision Making with Evaluative Feedback"]. Coursera.</ref>
|}

===Classification based on nature of prediction===

{| class="sortable" border="1"
! Type of prediction problem !! Description
|-
| binary classification problem || Here, the output has only two permissible values. Therefore, the prediction problem can be viewed as a yes/no or a true/false prediction problem. Models for binary classification can be probabilistic (such as [[logistic regression]] or [[artificial neural network]]s) or yes/no (such as [[support vector machine]]s). Note that probabilistic binary classification models structurally resemble regression models.
|-
| discrete variable prediction problem, or multi-class classification problem || Here, the output can take one of a finite number of values. We can also think of this as a classification problem where each input case needs to be sorted into one of finitely many classes.
|-
| regression problem, or continuous variable prediction problem || Here, the output can take a value over a continuous range of values.
|}

===Classification based on the stage of learning===

{| class="sortable" border="1"
! Stage of learning !! Description
|-
| [[eager learning]] || This is the more common form of learning. Here, the training data is pre-processed and an explicit and compact representation of the function is learned from it (usually, in the form of a paramater vector). The learning phase that uses the training data to learn the compact representation takes substantially greater time, but once this phase is done, the training data can be thrown away. The compact representation takes much less memory than the training data and can be used to quickly compute the output for any input.
|-
| [[lazy learning]] || With this form of learning, the training data is not completely pre-processed. Every time a new prediction needs to be made, the training data is used to make the prediction. Lazy learning algorithms are useful in cases where the prediction of function values is based on nearby values.
|-
| [[online learning]] || Here, the input data is streamed to the algorithm, one instance at a time. For the one-instance case (each data point) three steps occur: (a) The algorithm reads the data (the input only), (b) the algorithm makes a prediction of the output, (c) the algorithm learns the actual output and updates the parameters according to that.
|}

==References==

<references/>

Machine learning

2021-06-12T01:48:36Z

Vipul: /* The case of learning a function */

==Definition==

'''Machine learning''' can be defined as the problem of using partial data about a function or relation to make predictions about how the function or relation would perform on new data. It is the branch of AI that covers its statistical part.

===The case of learning a function===

Although this particular formulation is not the most natural formulation of all machine learning problems, any machine learning problem can be converted to this formulation.

An output function depends on some inputs (called [[feature]]s) (and possibly a random noise component). We have access to some training data, or thee ability to explore and discover training data, where we may or may not have the value of the output function, and we either already have or can get the values of the features. The machine learning problem asks for an explicit description of a function that would perform well on new inputs that we provide to it.

==Types of machine learning==

===Classification based on nature of training data===

{| class="sortable" border="1"
! Type of machine learning !! Description
|-
| [[supervised learning]] || Explicit training data (input-output pairs) are provided. These can be used to learn the parameters for the functional form that can then be used to make predictions.
|-
| [[unsupervised learning]] || The training data as provided does not provide explicit outputs for explicit inputs. For instance, in the context of classification problems, an unsupervised learning problem would simply provide a lot of inputs without specifying the output values for those inputs. The job of the machine learning algorithm is to use the distribution of the inputs to figure out the output values.
|-
| [[semi-supervised learning]] || The training data is a mix of explicit input-output pairs and input-only data. Semi-supervised learning combines some of the aspects of supervised learning and unsupervised learning.
|-
| [[reinforcement learning]] || Unlike supervised learning, the training data is not chosen by the user; instead, the agent being trained generates the training data by interacting with the environment. In addition, the kind of feedback is different: in supervised learning the feedback is ''instructive'' (i.e. does not depend on the outputs the system selects), but in reinforcement learning the feedback is ''evaluative'' (i.e. depends on the action selected by the agent).<ref>Sutton and Barto. ''Reinforcement Learning: An Introduction'' (2nd ed). p. 25.</ref><ref>[https://www.coursera.org/learn/fundamentals-of-reinforcement-learning/lecture/PtVBs/sequential-decision-making-with-evaluative-feedback "Sequential Decision Making with Evaluative Feedback"]. Coursera.</ref>
|}

===Classification based on nature of prediction===

{| class="sortable" border="1"
! Type of prediction problem !! Description
|-
| binary classification problem || Here, the output has only two permissible values. Therefore, the prediction problem can be viewed as a yes/no or a true/false prediction problem. Models for binary classification can be probabilistic (such as [[logistic regression]] or [[artificial neural network]]s) or yes/no (such as [[support vector machine]]s). Note that probabilistic binary classification models structurally resemble regression models.
|-
| discrete variable prediction problem, or multi-class classification problem || Here, the output can take one of a finite number of values. We can also think of this as a classification problem where each input case needs to be sorted into one of finitely many classes.
|-
| regression problem, or continuous variable prediction problem || Here, the output can take a value over a continuous range of values.
|}

===Classification based on the stage of learning===

{| class="sortable" border="1"
! Stage of learning !! Description
|-
| [[eager learning]] || This is the more common form of learning. Here, the training data is pre-processed and an explicit and compact representation of the function is learned from it (usually, in the form of a paramater vector). The learning phase that uses the training data to learn the compact representation takes substantially greater time, but once this phase is done, the training data can be thrown away. The compact representation takes much less memory than the training data and can be used to quickly compute the output for any input.
|-
| [[lazy learning]] || With this form of learning, the training data is not completely pre-processed. Every time a new prediction needs to be made, the training data is used to make the prediction. Lazy learning algorithms are useful in cases where the prediction of function values is based on nearby values.
|-
| [[online learning]] || Here, the input data is streamed to the algorithm, one instance at a time. For the one-instance case (each data point) three steps occur: (a) The algorithm reads the data (the input only), (b) the algorithm makes a prediction of the output, (c) the algorithm learns the actual output and updates the parameters according to that.
|}

==References==

<references/>

User:Vipul/Sandbox

2017-11-27T14:59:43Z

Vipul: Created page with "<math>e^{\pi^3 + \sqrt{3}}</math>"

Sandbox

2017-09-10T20:15:09Z

Vipul:

Logistic regression

2017-09-10T20:13:37Z

Vipul:

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term (after normalizing by the number of examples); the larger the regularization terms, the closer <math>w</math> is to zero. In particular, the greater the number of examples, the closer we get to <math>g^{-1}(q)</math>, holding the regularization term constant. This makes sense from the perspective of regularization as Bayesian prior: with small amounts of data we gravitate toward the Bayesian prior of even odds, whereas with a large amount of data we gravitate toward the frequency seen in the data.

Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, and a total of <math>m</math> examples, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \frac{\lambda_1}{m} |w| + \frac{\lambda_2}{m} w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \frac{\lambda_1}{m} \operatorname{sgn}(w) + \frac{2\lambda_2}{m} w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

=== Bias feature and a single binary feature ===

Consider a logistic regression model with a bias feature and a single binary feature (that can be either 0 or 1). Assume we have enough training data, and the binary feature is both zero and nonzero on enough examples.

There are two ways of operationalizing this. One is to train it as a two-feature model, with one bias feature and the single binary feature. Another is to train it as a ''three''-feature model, with one bias feature, the binary feature, and its complement. The latter approach has a linear relation between features (the binary feature and its complement add up to 1). However, it allows for a more clean-to-interpret model.

Assuming no regularization:

* With the two-feature model, the model weight learned on the bias feature is the log-odds of the probability of the event occurring if the single binary feature is off, and the model weight learned on the binary feature is the correction to the log-odds caused by the binary feature being true.
* With the three-feature model, the model weight learned on the bias feature is the log-odds of the overall probability, independent of whether the binary feature is true or false. The other two model weights give the respective corrections to the log-odds probability from the feature being true and false. These two model weights are of opposite sign, and they can be deduced from one another (but they are not literally negatives of each other, because (a) the feature may not be true and false the same amount of time, so there is a skew, and (b) these are additive corrections on log-odds not on probability itself, so the linearity is not preserved). In particular, the magnitude of the weight should generally be higher for the rarer of the two cases(since this gives more unique information, and is therefore expected to cause a larger update) but that is not always true.

In particular, if knowledge of the binary feature does not change our probability estimate, then the weight learned on the feature and/or its complement is zero.

==Unique property of the logistic link function ==

The logistic function is not the only possible choice of link function that can be used to apply generalized linear models to probabilistic binary classification; another choice is the normal CDF, used in [[probit regression]].

However, the logistic link function is the only link function <math>g</math> that satisfies the property of taking the value 1/2 at 0 (which is necessary for the symmetric sigmoidal shape we seek) ''and'' satisfies the condition that, if the true probability is <math>q</math> and the linear combination in question is <math>w</math>, then the derivative of the logarithmic cost function is <math>g(w) - q</math>, and therefore the second derivative is <math>g(w)(1 - g(w))</math> which is positive and bounded by <math>1/4</math>. This is because of the [[calculus:logistic function#differential equation|differential equation]] it satisfies. This provides an easy proof of convexity as well as bounded second derivative, and shows that [[calculus:gradient descent with constant learning rate for a convex function of multiple variables|gradient descent]] can be applied.

==Relation with other forms of machine learning==

===Probit regression===

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s used for probability prediction for binary classification || Both fit the description; the link function for logistic regression is the logistic function and the link function for probit regression is the normal CDF. || The logistic link function is the unique function where the cost grows quadratically with distance from the true probability, or equivalently, the marginal cost is linear in distance from the true probability.
|-
| Use of [[calculus:gradient descent with constant learning rate for a convex function of multiple variables|gradient descent]] || Logistic regression is a convex optimization problem with a globally bounded second derivative, therefore gradient descent works. What about probit? || ??
|}

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T20:13:15Z

Vipul: /* Unique property of the logistic link function */

Logistic regression

2017-09-10T20:00:11Z

Vipul: /* Probit regression */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term (after normalizing by the number of examples); the larger the regularization terms, the closer <math>w</math> is to zero. In particular, the greater the number of examples, the closer we get to <math>g^{-1}(q)</math>, holding the regularization term constant. This makes sense from the perspective of regularization as Bayesian prior: with small amounts of data we gravitate toward the Bayesian prior of even odds, whereas with a large amount of data we gravitate toward the frequency seen in the data.

Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, and a total of <math>m</math> examples, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \frac{\lambda_1}{m} |w| + \frac{\lambda_2}{m} w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \frac{\lambda_1}{m} \operatorname{sgn}(w) + \frac{2\lambda_2}{m} w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

=== Bias feature and a single binary feature ===

Consider a logistic regression model with a bias feature and a single binary feature (that can be either 0 or 1). Assume we have enough training data, and the binary feature is both zero and nonzero on enough examples.

There are two ways of operationalizing this. One is to train it as a two-feature model, with one bias feature and the single binary feature. Another is to train it as a ''three''-feature model, with one bias feature, the binary feature, and its complement. The latter approach has a linear relation between features (the binary feature and its complement add up to 1). However, it allows for a more clean-to-interpret model.

Assuming no regularization:

* With the two-feature model, the model weight learned on the bias feature is the log-odds of the probability of the event occurring if the single binary feature is off, and the model weight learned on the binary feature is the correction to the log-odds caused by the binary feature being true.
* With the three-feature model, the model weight learned on the bias feature is the log-odds of the overall probability, independent of whether the binary feature is true or false. The other two model weights give the respective corrections to the log-odds probability from the feature being true and false. These two model weights are of opposite sign, and they can be deduced from one another (but they are not literally negatives of each other, because (a) the feature may not be true and false the same amount of time, so there is a skew, and (b) these are additive corrections on log-odds not on probability itself, so the linearity is not preserved). In particular, the magnitude of the weight should generally be higher for the rarer of the two cases(since this gives more unique information, and is therefore expected to cause a larger update) but that is not always true.

In particular, if knowledge of the binary feature does not change our probability estimate, then the weight learned on the feature and/or its complement is zero.

==Unique property of the logistic link function ==

The logistic function is not the only possible choice of link function that can be used to apply generalized linear models to probabilistic binary classification; another choice is the normal CDF, used in [[probit regression]].

However, the logistic link function is the only link function <math>g</math> that satisfies the property of taking the value 1/2 at 0 (which is necessary for the symmetric sigmoidal shape we seek) ''and'' satisfies the condition that, if the true probability is <math>q</math> and the linear combination in question is <math>w</math>, then the derivative of the logarithmic cost function is <math>g(w) - q</math>. Verbally, this is saying that the cost grows quadratically with the distance from the true probability. This is because of the [[calculus:logistic function#differential equation|differential equation]] it satisfies.

==Relation with other forms of machine learning==

===Probit regression===

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s used for probability prediction for binary classification || Both fit the description; the link function for logistic regression is the logistic function and the link function for probit regression is the normal CDF. || The logistic link function is the unique function where the cost grows quadratically with distance from the true probability, or equivalently, the marginal cost is linear in distance from the true probability.
|-
| Use of [[calculus:gradient descent with constant learning rate for a convex function of multiple variables|gradient descent]] || Logistic regression is a convex optimization problem with a globally bounded second derivative, therefore gradient descent works. What about probit? || ??
|}

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T19:36:55Z

Vipul: /* Relation with other forms of machine learning */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term (after normalizing by the number of examples); the larger the regularization terms, the closer <math>w</math> is to zero. In particular, the greater the number of examples, the closer we get to <math>g^{-1}(q)</math>, holding the regularization term constant. This makes sense from the perspective of regularization as Bayesian prior: with small amounts of data we gravitate toward the Bayesian prior of even odds, whereas with a large amount of data we gravitate toward the frequency seen in the data.

Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, and a total of <math>m</math> examples, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \frac{\lambda_1}{m} |w| + \frac{\lambda_2}{m} w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \frac{\lambda_1}{m} \operatorname{sgn}(w) + \frac{2\lambda_2}{m} w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

=== Bias feature and a single binary feature ===

Consider a logistic regression model with a bias feature and a single binary feature (that can be either 0 or 1). Assume we have enough training data, and the binary feature is both zero and nonzero on enough examples.

There are two ways of operationalizing this. One is to train it as a two-feature model, with one bias feature and the single binary feature. Another is to train it as a ''three''-feature model, with one bias feature, the binary feature, and its complement. The latter approach has a linear relation between features (the binary feature and its complement add up to 1). However, it allows for a more clean-to-interpret model.

Assuming no regularization:

* With the two-feature model, the model weight learned on the bias feature is the log-odds of the probability of the event occurring if the single binary feature is off, and the model weight learned on the binary feature is the correction to the log-odds caused by the binary feature being true.
* With the three-feature model, the model weight learned on the bias feature is the log-odds of the overall probability, independent of whether the binary feature is true or false. The other two model weights give the respective corrections to the log-odds probability from the feature being true and false. These two model weights are of opposite sign, and they can be deduced from one another (but they are not literally negatives of each other, because (a) the feature may not be true and false the same amount of time, so there is a skew, and (b) these are additive corrections on log-odds not on probability itself, so the linearity is not preserved). In particular, the magnitude of the weight should generally be higher for the rarer of the two cases(since this gives more unique information, and is therefore expected to cause a larger update) but that is not always true.

In particular, if knowledge of the binary feature does not change our probability estimate, then the weight learned on the feature and/or its complement is zero.

==Unique property of the logistic link function ==

The logistic function is not the only possible choice of link function that can be used to apply generalized linear models to probabilistic binary classification; another choice is the normal CDF, used in [[probit regression]].

However, the logistic link function is the only link function <math>g</math> that satisfies the property of taking the value 1/2 at 0 (which is necessary for the symmetric sigmoidal shape we seek) ''and'' satisfies the condition that, if the true probability is <math>q</math> and the linear combination in question is <math>w</math>, then the derivative of the logarithmic cost function is <math>g(w) - q</math>. Verbally, this is saying that the cost grows quadratically with the distance from the true probability. This is because of the [[calculus:logistic function#differential equation|differential equation]] it satisfies.

==Relation with other forms of machine learning==

===Probit regression===

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s used for probability prediction for binary classification || Both fit the description; the link function for logistic regression is the logistic function and the link function for probit regression is the normal CDF. || The logistic link function is the unique function where the cost grows quadratically with distance from the true probability, or equivalently, the marginal cost is linear in distance from the true probability.
|}

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T19:34:50Z

Vipul: /* Relation with other forms of machine learning */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term (after normalizing by the number of examples); the larger the regularization terms, the closer <math>w</math> is to zero. In particular, the greater the number of examples, the closer we get to <math>g^{-1}(q)</math>, holding the regularization term constant. This makes sense from the perspective of regularization as Bayesian prior: with small amounts of data we gravitate toward the Bayesian prior of even odds, whereas with a large amount of data we gravitate toward the frequency seen in the data.

Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, and a total of <math>m</math> examples, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \frac{\lambda_1}{m} |w| + \frac{\lambda_2}{m} w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \frac{\lambda_1}{m} \operatorname{sgn}(w) + \frac{2\lambda_2}{m} w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

=== Bias feature and a single binary feature ===

Consider a logistic regression model with a bias feature and a single binary feature (that can be either 0 or 1). Assume we have enough training data, and the binary feature is both zero and nonzero on enough examples.

There are two ways of operationalizing this. One is to train it as a two-feature model, with one bias feature and the single binary feature. Another is to train it as a ''three''-feature model, with one bias feature, the binary feature, and its complement. The latter approach has a linear relation between features (the binary feature and its complement add up to 1). However, it allows for a more clean-to-interpret model.

Assuming no regularization:

* With the two-feature model, the model weight learned on the bias feature is the log-odds of the probability of the event occurring if the single binary feature is off, and the model weight learned on the binary feature is the correction to the log-odds caused by the binary feature being true.
* With the three-feature model, the model weight learned on the bias feature is the log-odds of the overall probability, independent of whether the binary feature is true or false. The other two model weights give the respective corrections to the log-odds probability from the feature being true and false. These two model weights are of opposite sign, and they can be deduced from one another (but they are not literally negatives of each other, because (a) the feature may not be true and false the same amount of time, so there is a skew, and (b) these are additive corrections on log-odds not on probability itself, so the linearity is not preserved). In particular, the magnitude of the weight should generally be higher for the rarer of the two cases(since this gives more unique information, and is therefore expected to cause a larger update) but that is not always true.

In particular, if knowledge of the binary feature does not change our probability estimate, then the weight learned on the feature and/or its complement is zero.

==Unique property of the logistic link function ==

The logistic function is not the only possible choice of link function that can be used to apply generalized linear models to probabilistic binary classification; another choice is the normal CDF, used in [[probit regression]].

However, the logistic link function is the only link function <math>g</math> that satisfies the property of taking the value 1/2 at 0 (which is necessary for the symmetric sigmoidal shape we seek) ''and'' satisfies the condition that, if the true probability is <math>q</math> and the linear combination in question is <math>w</math>, then the derivative of the logarithmic cost function is <math>g(w) - q</math>. Verbally, this is saying that the cost grows quadratically with the distance from the true probability. This is because of the [[calculus:logistic function#differential equation|differential equation]] it satisfies.

==Relation with other forms of machine learning==

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T19:09:36Z

Vipul: /* Single-feature bias or intercept model */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term (after normalizing by the number of examples); the larger the regularization terms, the closer <math>w</math> is to zero. In particular, the greater the number of examples, the closer we get to <math>g^{-1}(q)</math>, holding the regularization term constant. This makes sense from the perspective of regularization as Bayesian prior: with small amounts of data we gravitate toward the Bayesian prior of even odds, whereas with a large amount of data we gravitate toward the frequency seen in the data.

Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, and a total of <math>m</math> examples, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \frac{\lambda_1}{m} |w| + \frac{\lambda_2}{m} w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \frac{\lambda_1}{m} \operatorname{sgn}(w) + \frac{2\lambda_2}{m} w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

=== Bias feature and a single binary feature ===

Consider a logistic regression model with a bias feature and a single binary feature (that can be either 0 or 1). Assume we have enough training data, and the binary feature is both zero and nonzero on enough examples.

There are two ways of operationalizing this. One is to train it as a two-feature model, with one bias feature and the single binary feature. Another is to train it as a ''three''-feature model, with one bias feature, the binary feature, and its complement. The latter approach has a linear relation between features (the binary feature and its complement add up to 1). However, it allows for a more clean-to-interpret model.

Assuming no regularization:

* With the two-feature model, the model weight learned on the bias feature is the log-odds of the probability of the event occurring if the single binary feature is off, and the model weight learned on the binary feature is the correction to the log-odds caused by the binary feature being true.
* With the three-feature model, the model weight learned on the bias feature is the log-odds of the overall probability, independent of whether the binary feature is true or false. The other two model weights give the respective corrections to the log-odds probability from the feature being true and false. These two model weights are of opposite sign, and they can be deduced from one another (but they are not literally negatives of each other, because (a) the feature may not be true and false the same amount of time, so there is a skew, and (b) these are additive corrections on log-odds not on probability itself, so the linearity is not preserved). In particular, the magnitude of the weight should generally be higher for the rarer of the two cases(since this gives more unique information, and is therefore expected to cause a larger update) but that is not always true.

In particular, if knowledge of the binary feature does not change our probability estimate, then the weight learned on the feature and/or its complement is zero.

==Relation with other forms of machine learning==

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T18:58:45Z

Vipul: /* Single-feature bias or intercept model */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term (after normalizing by the number of examples); the larger the regularization terms, the closer <math>w</math> is to zero. In particular, the greater the number of examples, the closer we get to <math>g^{-1}(q)</math>, holding the regularization term constant. This makes sense from the perspective of regularization as Bayesian prior: with small amounts of data we gravitate toward the Bayesian prior of even odds, whereas with a large amount of data we gravitate toward the frequency seen in the data.

Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, and a total of <math>m</math> examples, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \frac{\lambda_1}{m} |w| + \frac{\lambda_2}{m} w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \frac{\lambda_1}{m} \operatorname{sgn}(w) + \frac{2\lambda_2}{m} w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

==Relation with other forms of machine learning==

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T18:55:11Z

Vipul: /* Empty feature set and empty model */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[calculus:logistic function|logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term. Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \lambda_1 |w| + \lambda_2 w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \lambda_1 \operatorname{sgn}(w) + 2\lambda_2 w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

==Relation with other forms of machine learning==

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T18:54:07Z

Vipul: /* Computational format */

==Summary==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Binary (yes/no)
|-
| Format of prediction || Probabilistic. Rather than simply returning a binary answer, the prediction gives the respective probabilities of the two answers.
|-
| Functional form of model || Computes the probability by applying the [[calculus:logistic function|logistic function]] to a linear combination of the features. The coefficients used in the linear combination are the unknown parameters that need to be determined by the learning algorithm. It is an example of a [[generalized linear model]]. The parameters are sometimes called the ''model weights'', with each model weight associated to a particular feature for which it is the coefficient. The feature for which the model weight is zero can be thought of as not being part of the model, since its value plays no role in the prediction. We sometimes say that the features with nonzero model weights are features "picked" by the training.
|-
| Typical cost function || As with most probabilistic binary prediction models, logistic regression models are typically scored using the [[calculus:logarithmic scoring|logarithmic cost function]]. However, they could in principle be scored using the squared error cost function. Note that this still wouldn't be least-squares regression, because the least-squares is being computed ''after'' applying the logistic function.
|-
| Typical regularization choices || Both <math>L^1</math>- and <math>L^2</math>-regularization, as well as combined regularization using <math>L^1</math> and <math>L^2</math> terms, are common.
|-
| Learning algorithms || See [http://www.cs.iastate.edu/~honavar/minka-logreg.pdf here for more] (to eventually fill in here).
|}

==Definition==

The term '''logistic regression''' is used for a model as well as the act of finding the parameters of the model whose goal is to predict binary outputs. It is therefore better viewed as solving a [[classification]] problem than a regression problem. However, because the model shares many basic components with [[linear regression]], and is an example of a [[generalized linear model]], it has historically gone by the name of logistic ''regression''.

The logistic regression problem attempts to predict a binary output (yes/no) based on a set of inputs (called [[feature]]s). Rather than just predicting a yes/no answer, the logistic regression problem predicts a probability of yes. This is a number in <math>[0,1]</math>. By using a threshold probability (such as 0.5, or another value depending on what sorts of risks we want to avoid) this can make a yes/no prediction.

The probability is computed as follows:

Probability = [[calculus:logistic function|logistic function]] evaluated at (linear combination of features with initially unknown parameters)

The logistic function is the function:

<math>g(x) = \frac{1}{1 + e^{-x}}</math>

The values of the unknown parameters are determined empirically so as to best fit the [[training set]].

===Cost function used===

The typical cost function used is the logarithmic cost function (also known as [[calculus:logarithmic scoring rule|logarithmic scoring]]): This assigns a score of <math>-\log p</math> if the event happened and a score of <math>-\log (1 - p)</math> if the event did not happen. The lower the score, the better. The [[calculus:logarithmic scoring rule is proper|logarithmic scoring rule is proper]]: if the true probability is <math>q</math>, then the score is minimized by predicting <math>p = q</math>.

Note that if we could predict whether or not the event will happen with perfect confidence, the logarithmic score would evaluate to 0.

The logarithmic cost function is computed for each of the predictions made by the logistic regression model. We then average the values of the cost functions across all instances to obtain the logarithmic cost function for the specific choice of parameter values on the specific data set.

There are two standard choices of labels for describing whether the event did or did not occur. One choice is to assign a label of 0 if the event did not occur and 1 if the event occurred. Another choice is to assign a label of -1 if the event did not occur and 1 if the event occurred.

====Closed form expression for cost function using 0,1-encoding====

Suppose we assign a label <math>y</math> with value 0 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-(y \log p + (1 - y)\log(1 - p))</math>

Suppose there are <math>m</math> data points. The probability vector is the vector <math>\vec{y} = (y_1,y_2,\dots,y_m)</math> and the probability vector is the vector <math>\vec{p} = (p_1,p_2,\dots,p_m)</math>. The cost function is:

<math>\frac{1}{m} \left[\sum_{i=1}^m -(y_i \log p_i + (1 - y_i)\log(1 - p_i))\right]</math>

====Closed form expression for cost function using -1,1-encoding====

Suppose we assign a label <math>l</math> with value -1 if the event did not occur and 1 if the event occurred. Then, if <math>p</math> is the predicted probability, the score associated with <math>p</math> is:

<math>-\frac{1}{2} ((1 + l) \log p + (1 - l)\log(1 - p))</math>

===Description as a generalized linear model===

The logistic regression model can be viewed as a special case of the [[generalized linear model]], namely a case where the link function is the [[calculus:logistic function|logistic function]] and where the cost function is the logarithmic cost function.

The inverse of the logistic function is the log-odds function, and applying it to the probability gives the log-odds (logarithm of odds). Explicitly, we have:

<math>g^{-1}(p) = \ln \left( \frac{p}{1 - p}\right)</math>

Therefore, the logistic regression problem can be viewed as a linear regression problem:

Log-odds function = Linear combination of features with unknown parameters

''However'', the cost function now changes as well: we now need to apply the logistic function and then do logarithmic scoring to compute the cost function.

==Computational format==

The computational format for a logistic regression is as follows. Note that there may be variations in terms of the roles of rows and columns. We follow the convention of using column vectors and having the matrix multiplied on the left of the vector.

Some notation:

* <math>m</math> denotes the number of examples (data points).
* <math>n</math> denotes the number of features, or equivalently, the number of parameters. Note that the number of elementary features need not equal <math>n</math>. The "features" we are referring to are expressions in the elementary features that we can use as the spanning set for our arbitrary linear combinations whose coefficients are the unknown parameters we need to find.
* <math>X</math> is the data matrix or design matrix of the regression. <math>X</math> is a <math>m \times n</math> matrix. Each row of <math>X</math> corresponds to one example. Each column of <math>X</math> corresponds to one feature (not necessarily an elementary feature) and hence also to one coordinate of the parameter vector (the coefficient on that feature). The entry in a given row and given column is the feature value for that example.
* The vector of labels (or actual outputs) is a <math>m</math>-dimensional vector. If we use the 0-1 convention, this is a vector all of whose coordinates are either 0 or 1. If we use the <math>\{ -1,1 \}</math>-convention, this is a vector all of whose coordinates are either -1 or 1. For convenience on this page, we'll denote the former vector by <math>\vec{y}</math> and the latter by <math>\vec{l}</math>. We have the relations <math>y_i = (1 + l_i)/2</math>, and <math>l_i = 2y_i - 1</math>, for all <math>i</math>.
* The parameter vector is a <math>n</math>-dimensional vector. We will denote it as <math>\vec{\theta}</math>.

The predicted probability vector is given as:

<math>\vec{p} = g(X \vec{\theta})</math>

where <math>g</math> is the [[calculus:logistic function|logistic function]] and is applied coordinate-wise.

== Examples of feature sets and models ==

=== Empty feature set and empty model ===

The "empty model" for a logistic regression problem is the model with no features, or alternatively, the model where all the features have zero model weights. The linear combination generated for any example is zero, so the probability predicted for any example is the [[logistic function]] applied at 0, which is <math>g(0) = 1/2</math>.

The logarithmic cost function for each example is therefore <math>-\log(1/2) = \log 2 \approx 0.6931</math>, and hence, so is the arithmetic mean. This is treated as a baseline for the logarithmic loss on logistic regression models and for any binary classification models predicting a probability; any logistic regression model that is trained properly should provide a lower (better) cost than the empty model.

Standard choices of regularization such as <math>L^1</math> ([[lasso]]), <math>L^2</math> ([[ridge regression]]), and elastic net (a mix of the two) all give their lowest penalty of zero to the empty model. Therefore, in general, if the event occurs about half the time and none of the features being trained on have any signal useful for predicting the outcome, a regularized logistic regression will converge to the empty model (however, note the caveat on the bias term below).

=== Single-feature bias or intercept model ===

This is a model with a single nonzero model weight, corresponding to a feature that is 1 on all examples. If this weight is <math>w</math>, then the linear combination is <math>w</math> on each example, and the predicted probability works out to <math>g(w)</math>.

If this model is trained without regularization on a training set, then the weight <math>w</math> that it learns is <math>g^{-1}(q)</math>, where <math>q</math> is the fraction of positive examples in the training set. If the training set is sufficiently large and representative, then <math>q</math> equals the probability of occurrence of the event, so the model essentially predicts the probability of occurrence (the base rate) without trying to figure out which examples are more or less likely to be positive.

In the case of regularization such as <math>L^1</math>, <math>L^2</math>, or elastic net, the weight <math>w</math> learned is between <math>0</math> and <math>g^{-1}(q)</math>, where its exact position is determined by the strength of the regularization term. Specifically, for the case of elastic net regularization with <math>L^1</math>-parameter <math>\lambda_1</math> and <math>L^2</math>-parameter <math>\lambda_2</math>, we have to pick <math>w</math> that minimizes:

<math>-q \log(g(w)) - (1 - q) \log(1 - g(w)) + \lambda_1 |w| + \lambda_2 w^2</math>

Taking derivatives and finding critical points, we get:

<math>g(w) - q + \lambda_1 \operatorname{sgn}(w) + 2\lambda_2 w = 0</math>

There is no analytical solution to this but it can be solved using numerical techniques.

==Relation with other forms of machine learning==

===Linear regression===

Logistic regression and [[linear regression]] are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| [[Generalized linear model]]s, so linear dependence on inputs || Both are examples of generalized linear models || For linear regression, the link function is the identity function and the typical choice of cost function is the squared error cost function. In the case of logistic regression, the link function is the [[calculus:logistic function|logistic function]] and the typical choice of cost function is the logarithmic cost function.
|-
| Prediction of continuous variables || ''Prima facie'', both of them output variables that take continuous values || Linear regression outputs a continuous variable that is the estimate of the output being predicted. The continuous variable output by logistic regression is the ''probability'' associated with a binary classification problem.
|}

===Support vector machines===

Logistic regression and the [[support vector machine]] (SVM) regression method are related in the following ways:

{| class="sortable" border="1"
! Aspect !! How they're similar !! How they're different
|-
| Binary classification || Both logistic regression and support vector model are approaches to tackling binary classification. || Logistic regression outputs a probability, whereas support vector models output a yes/no answer. Support vector machines ''can'' be construed as giving an output describing the confidence of a classification, but this is not explicitly translated into a probability. Note that the linear SVM result can be interpreted as a result for the logistic regression problem, and running linear SVM and logistic regression on the same data set can yield very similar results.
|}

===Artificial neural networks===

Artificial neural networks are a more complicated type of machine learning setup that is capable of learning more complex functions. The individual units in an artificial neural network, called [[artificial neuron]]s, can in principle be chosen to be any functions, but the typical choice is to choose each of them as a logistic regression model. In other words, the output of each artificial neuron is obtained by computing the logistic function of a linear combination (via an unknown parameter vector) of the inputs.

===Maximum entropy (MaxEnt) models===

Maximum entropy models generalize logistic regression to particular types of classification problems where the relative probabilities of the discrete classes satisfy a particular kind of mathematical relationship (the need for a constraint on the relationship arises only when there are three or more different possibilities; no assumptions are necessary in the binary case).

Logistic regression

2017-09-10T18:26:49Z

Vipul:

Comparison of lasso and ridge regularization

2017-09-10T17:52:49Z

Vipul:

== Case of generalized linear model (such as linear regression or logistic regression) ==

{| class="sortable" border="1"
! Case on features !! Result under lasso !! Result under ridge
|-
| Two copies of the same feature. If only one copy had been included, a parameter value of <math>w</math> would have been learned. || The weight <math>w</math> gets split across the features, but in an indeterminate way. In other words, you could get any distribution <math>\alpha w, (1 - \alpha) w</math>. || The weight <math>w</math> gets evenly split across the features: <math>w/2</math> each.
|-
| One primary feature, that for simplicity we take as a binary feature that is nonzero on some fraction of examples. Two other "backup" features, that are each nonzero in disjoint halves of the case the primary feature is nonzero. Once the primary feature is known, there is no additional signal in knowing the values of the backup features. If the backup features were excluded, the primary feature would learn a weight of <math>w</math>. || The weight <math>w</math> all goes on the primary features, and the backup features get weights of zero. || The primary features gets weights <math>(2/3)w</math> and each of the backup features get weight <math>(1/3)w</math>.
|}

Logistic regression

2017-09-10T17:46:42Z

Vipul: /* Cost function used */

Cost function

2017-09-10T16:44:57Z

Vipul: /* Binary classification problems */

The '''cost function''' associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.

==Definition==

This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems.

===Regression problems===

For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function <math>C\colon \mathbb R \times \mathbb R \to \mathbb R</math> of two variables <math>u,v \in \mathbb R</math> (the predicted value and actual value) satisfying the following conditions:

* <math>C(u,u) = 0</math> for all <math>u \in \R</math>
* For <math>u \le v \le w</math>, we have <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math>

The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.

Two typical examples of cost functions are <math>C(u,v) = (u-v)^2</math> (the [[quadratic cost function]], used for least squares linear regression) and <math>C(u,v) = |u-v|</math> (used for least absolute deviations regression). With the former cost function, we see that for <math>-1,0,1 \in \mathbb R</math>, we have <math>(-1-0)^2 + (0-1)^2 = 2 \not\geq 4 = (-1-1)^2</math>, so the triangle inequality is not satisfied.

===Binary classification problems===

For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function <math>C\colon [0,1]\times \{0,1\} \to \mathbb R</math> of two variables <math>p\in [0,1]</math> and <math>v \in \{0,1\}</math> (the predicted probability and actual value) satisfying the following conditions:

* <math>C(1,1) = 0</math>
* <math>C(0,0) = 0</math>
* For <math>p \le q</math>, we have <math>C(q,1) \le C(p,1)</math>
* For <math>p \le q</math>, we have <math>C(p,0) \le C(q,0)</math>

Although not a strict requirement, cost functions should be selected to be [[calculus:proper scoring rule|proper scoring rule]]s, so that they penalize accurate probabilities less than inaccurate probabilities.

Typical cost functions used for binary classification problems include the [[calculus:logarithmic scoring rule|logarithmic cost function]] and the [[quadratic cost function]], both of which are proper scoring rules. Of these, the logarithmic scoring rules is the more widely used, because it is the only [[calculus:logarithmic scoring rule is the only proper scoring rule up to affine transformations in case of more than two classes|proper scoring rule when considering more than two classes]].

==Combining the cost function values across multiple data points==

To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:

* Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
* Mean using <math>r^{\text{th}}</math> powers, for some <math>r > 1</math>: We take the mean of the <math>r^{\text{th}}</math> powers of all the cost functions, then take the <math>r^{\text{th}}</math> root.
* Maximum value

Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.

Cost function

2017-09-10T16:42:29Z

Vipul: /* Regression problems */

The '''cost function''' associated with a given machine learning problem is a function that takes as input the predicted function value and actual observed output and associates to them a number measuring how far the predicted value is from the observed value.

==Definition==

This section gives the definitions of cost functions for a single piece of data, for both regression problems and classification problems.

===Regression problems===

For regression problems (prediction problems associated with continuous variables), both the predicted value and the actual value are continuous variables. The cost function is a function <math>C\colon \mathbb R \times \mathbb R \to \mathbb R</math> of two variables <math>u,v \in \mathbb R</math> (the predicted value and actual value) satisfying the following conditions:

* <math>C(u,u) = 0</math> for all <math>u \in \R</math>
* For <math>u \le v \le w</math>, we have <math>C(u,v) \le C(u,w)</math> and <math>C(v,w) \le C(u,w)</math>

The cost function need not satisfy the triangle inequality; in fact, typical cost functions penalize bigger errors superlinearly.

Two typical examples of cost functions are <math>C(u,v) = (u-v)^2</math> (the [[quadratic cost function]], used for least squares linear regression) and <math>C(u,v) = |u-v|</math> (used for least absolute deviations regression). With the former cost function, we see that for <math>-1,0,1 \in \mathbb R</math>, we have <math>(-1-0)^2 + (0-1)^2 = 2 \not\geq 4 = (-1-1)^2</math>, so the triangle inequality is not satisfied.

===Binary classification problems===

For binary classification problems (prediction problems associated with discrete variables), the predicted value is a probability and the actual value is simply a discrete value (0 or 1). The cost function is a function <math>C\colon [0,1]\times \{0,1\} \to \mathbb R</math> of two variables <math>p\in [0,1]</math> and <math>v \in \{0,1\}</math> (the predicted probability and actual value) satisfying the following conditions:

* <math>C(1,1) = 0</math>
* <math>C(0,0) = 0</math>
* For <math>p \le q</math>, we have <math>C(q,1) \le C(p,1)</math>
* For <math>p \le q</math>, we have <math>C(p,0) \le C(q,0)</math>

==Combining the cost function values across multiple data points==

To compute the cost function for several data points, we need to know the cost function for a single data point, as well as an approach for averaging the cost functions. The following are some typical choices:

* Arithmetic mean: This is the most common, and the default specification. This is equivalent to using the sum of the cost functions, but using the mean instead of the sum is preferable because that allows us to directly compare cost function values for data sets of different sizes.
* Mean using <math>r^{\text{th}}</math> powers, for some <math>r > 1</math>: We take the mean of the <math>r^{\text{th}}</math> powers of all the cost functions, then take the <math>r^{\text{th}}</math> root.
* Maximum value

Note that there is some flexibility in terms of how we divide the load between the choice of cost function and the choice of averaging function: for instance, there is some equivalence between using the absolute value cost function and the root mean square averaging process versus using the squared error cost function and the arithmetic mean averaging process. The cost functions obtained in both cases are equivalent under a monotone transformation. If, however, we are considering adding regularization terms, then the distinction between the cost functions matters.

Lasso

2017-09-10T16:25:46Z

Vipul: /* Effects of lasso */

== Definition ==

Lasso, also known as <math>L^1</math>-regularization, is a type of [[regularization]] where the regularization term is of the following form, where <math>w_1, w_2, \dots, w_n</math> are the unknown parameters of the form being trained for:

<math>\lambda \sum_{i=1}^n | w_i |</math>

In some cases, due to scaling issues, a lasso may not make direct sense, so we may need additional (predetermined) coefficients <math>\alpha_1, \alpha_2, \dots, \alpha_n</math> to rescale the weights:

<math>\lambda \sum_{i=1}^n \alpha_i |w_i|</math>

== Effects of lasso ==

=== Summary ===

{| class="sortable" border="1"
! Item !! Value
|-
| Convexity || The lasso function is convex. Thus, if the original cost function is convex, the regularized cost function is also convex. In particular, it does not destroy the ability to apply optimization methods that rely solely on convexity.
|-
| Differentiability || The lasso function is differentiable at most points except where one of the weights becomes zero (with partial derivative being undefined in the direction parallel to that weight vector). This can come in the way of iterative application of [[gradient descent]] since the gradient vector is undefined. However, setting partial derivative to zero when it is undefined generally works.
|-
| Type of model generated || Lasso regression pushes towards models where some parameters become precisely zero. It will also generally push those features to be zero that can help distinguish fewer examples, so for instance if a dense feature and a sparser feature both play a similar predictive role, lasso will tend to set the sparser feature to zero. For more, see [[comparison of lasso and ridge regularization]].
|}

Comparison of lasso and ridge regularization

2017-09-10T16:25:18Z

Vipul: Created page with "{| class="sortable" border="1" ! Case on features !! Result under lasso !! Result under ridge |- | Two copies of the same feature. If only one copy had been included, a param..."

{| class="sortable" border="1"
! Case on features !! Result under lasso !! Result under ridge
|-
| Two copies of the same feature. If only one copy had been included, a parameter value of <math>w</math> would have been learned. || The weight <math>w</math> gets split across the features, but in an indeterminate way. In other words, you could get any distribution <math>\alpha w, (1 - \alpha) w</math>. || The weight <math>w</math> gets evenly split across the features: <math>w/2</math> each.
|-
| One primary feature, that for simplicity we take as a binary feature that is nonzero on some fraction of examples. Two other "backup" features, that are each nonzero in disjoint halves of the case the primary feature is nonzero. Once the primary feature is known, there is no additional signal in knowing the values of the backup features. If the backup features were excluded, the primary feature would learn a weight of <math>w</math>. || The weight <math>w</math> all goes on the primary features, and the backup features get weights of zero. || The primary features gets weights <math>(2/3)w</math> and each of the backup features get weight <math>(1/3)w</math>.
|}

Lasso

2017-09-10T16:12:29Z

Vipul:

Lasso

2017-09-10T16:12:07Z

Vipul: Created page with "== Definition == Lasso, also known as <math>L^1</math>-regularization, is a type of regularization where the regularization term is of the following form, where <math>w_1..."

== Definition ==

Lasso, also known as <math>L^1</math>-regularization, is a type of [[regularization]] where the regularization term is of the following form, where <math>w_1, w_2, \dots, w_n</math> are the unknown parameters of the form being trained for:

<math>\lambda \sum_{i=1}^n | w_i |</math>

In some cases, due to scaling issues, a lasso may not make direct sense, so we may need additional (predetermined) coefficients <math>\alpha_1, \alpha_@, \dots, \alpha_n</math> to rescale the weights:

<math>\lambda \sum_{i=1}^n \alpha_i |w_i|</math>

== Effects of lasso ==

=== Summary ===

{| class="sortable" border="1"
! Item !! Value
|-
| Convexity || The lasso function is convex. Thus, if the original cost function is convex, the regularized cost function is also convex. In particular, it does not destroy the ability to apply optimization methods that rely solely on convexity.
|-
| Differentiability || The lasso function is differentiable at most points except where one of the weights becomes zero (with partial derivative being undefined in the direction parallel to that weight vector). This can come in the way of iterative application of [[gradient descent]] since the gradient vector is undefined. However, setting partial derivative to zero when it is undefined generally works.
|-
| Type of model generated || Lasso regression pushes towards models where some parameters become precisely zero. It will also generally push those features to be zero that can help distinguish fewer examples, so for instance if a dense feature and a sparser feature both play a similar predictive role, lasso will tend to set the sparser feature to zero.
|}

L^1 regularization

2017-09-10T15:40:26Z

Vipul: Redirected page to Lasso

#redirect [[Lasso]]

Regularization

2017-09-10T15:39:32Z

Vipul:

==Definition==

'''Regularization''' is described as follows.

We have:

* A set of [[feature]]s that predict an output value
* A set of [[training data]], including the set of features and the output value
* A functional form, in terms of unknown parameters, that describes the output value in terms of the features. These parameters are also sometimes known as model weights.
* A choice of [[cost function]] (or error function) that measures the error for a given pair of predicted output and actual output.

Regularization refers to a process where we modify the cost function by adding to it an expression that captures the complexity of the feature set. The expression is typically a product of a [[hyperparameter]] (subject to smart [[hyperparameter optimization]]) and a fixed function of the parameters being learned (i.e., the model weights) chosen based on the problem domain. This fixed function is typically the <math>L^1</math>-norm, squared <math>L^2</math>-norm, or <math>L^\infty</math>-norm of the feature vector (sometimes with some coordinates removed).

Note that the choice of regularization, including the choice of hyperparameter, need to be known by the [[learning algorithm]]. Moreover, some learning algorithms that work for unregularized problems may not work for regularized problems, or may need to be modified to tackle the version with regularization.

Regularization is used ''only on the training data'', not on test data that was withheld from the learning algorithm.

===Goal: enforcing simplicity and reducing complexity===

Regularization introduces a penalty for complexity, and forces the parameter vector to be simple. This reduces the extent of [[overfitting]].

Regularization can also enforce unique solutions in the case of overdetermined problems.

==Hyperparameter optimization for the regularization hyperparameter==

{{further|[[hyperparameter optimization]]}}

Ideally, we would like to choose a [[regularization hyperparameter]] such that the parameters found by the model do best on new data that was withheld from the learning algorithm. The approach used for this is [[cross-validation]]: we cordon off a part of the training set from the learning algorithm (this cordoned-off part is called the cross-validation set), run the learning algorithm for different choices of hyperparameter, and compare the performance of all the solutions obtained on the cross-validation set. We pick the one that does best on this set and then check that it did well on the training set.

Some algorithms use the test set as their cross-validation set. For sufficiently large data sets, this is not a problem. However, for small and intermediate-sized data sets, this is problematic because we end up overfitting the regularization parameter itself by exposing it to influence from the test set. The neatest approach is to keep the cross-validation and test sets separate.

Linear regression

2017-09-10T15:37:05Z

Vipul: Created page with "== Summary == {| class="sortable" border="1" ! Item !! Value |- | Type of variable predicted || Continuous; however this could also be used to predict discrete variables that..."

== Summary ==

{| class="sortable" border="1"
! Item !! Value
|-
| Type of variable predicted || Continuous; however this could also be used to predict discrete variables that can be placed on a continuum, such as values that are constrained to be integers.
|-
| Format of prediction || A point estimate of the value is output
|-
| Functional form of model || Computes the point estimate for the value being predicted by taking a linear combination of the features. The coefficients for the linear combination are the unknown parameters (also known as ''model weights'') that need to be determined by the learning algorithm.
|-
| Typical cost function || The most typical is ordinary least squares (OLS) regression, where the loss associated with each prediction is the square of the distance between the prediction and the actual value. There are many variants, such as weighted least squares (where different predictions get different weights), total least squares (where errors in both dependent and independent variables are modeled), non-negative least squares (where the parameters are all constrained to be non-negative).
|-
| Typical regularization choices || <math>L^1</math> (lasso), <math>L^2</math> (ridge), and a mix (elastic net). Other Bayesian priors may also be used to generate regularizations.
|}

Logistic regression

2017-09-10T15:24:51Z

Vipul: /* Summary */

Logistic regression

2017-09-10T15:19:05Z

Vipul: /* Support vector machines */