Backpropagation: Difference between revisions

Revision as of 01:22, 24 March 2018

Backpropagation has several related meanings when talking about neural networks. Here we mean the process of finding the gradient of a cost function.

Terminology confusion

Some people use "backpropagation" to mean the process of finding the gradient of a cost function. This leads people to say things like "backpropagation is just the multivariable chain rule".
Some people use "backpropagation" or maybe "backpropagation algorithm" to mean the entire gradient descent algorithm, where the gradient is computed using the multivariable chain rule.

The usual neural network graph

Neural networks are usually illustrated where each node $a_{j}^{\ell }$ is the activation of neuron $j$ in layer $\ell$ . Then the weights are labeled along the edges.

Computational graphs

The multivariable chain rule can be represented as a computational graph where the nodes are variables that store values from intermediate computations. Each node can use values given to by edges going into it, and sends its output to nodes going out.

A different neural network graph

Naively, we might try to piece together the information in the previous sections as follows:

The usual neural network graph looks like a computational graph: it's got nodes storing intermediate values that feed into later layers.
We know computational graphs are good at representing the process of computing the gradient.
In order to do gradient descent, we want the gradient of the cost function.
Therefore, by (1)-(3), we can use the neural network graph to represent the process of the computing gradient.

The problem with the above argument is that the computational graph is only good at representing the process of computing the gradient when the variables we are differentiating with respect to are nodes on the graph. But in our case, the cost function is $C(W,b;x)$ rather than $C(x;W,b)$ , i.e. since we are tinkering with the weights of the network, we want to think of the cost as a function of the weights (parameterized by the input $x$ ) rather than as a function of the input $x$ (parameterized by the weights $W$ ).

@@ Line 22: / Line 22: @@
 # In order to do gradient descent, we want the gradient of the cost function.
 # Therefore, by (1)-(3), we can use the neural network graph to represent the process of the computing gradient.
+The problem with the above argument is that the computational graph is only good at representing the process of computing the gradient ''when the variables we are differentiating with respect to are nodes on the graph''. But in our case, the cost function is <math>C(W,b;x)</math> rather than <math>C(x;W,b)</math>, i.e. since we are tinkering with the weights of the network, we want to think of the cost as a function of the weights (parameterized by the input <math>x</math>) rather than as a function of the input <math>x</math> (parameterized by the weights <math>W</math>).
 ==See also==

Backpropagation: Difference between revisions

Revision as of 01:22, 24 March 2018

Terminology confusion

The usual neural network graph

Computational graphs

A different neural network graph

See also

References

External links