Backpropagation
Backpropagation has several related meanings when talking about neural networks. Here we mean the process of finding the gradient of a cost function.
Terminology confusion
- Some people use "backpropagation" to mean the process of finding the gradient of a cost function. This leads people to say things like "backpropagation is just the multivariable chain rule".
- Some people use "backpropagation" or maybe "backpropagation algorithm" to mean the entire gradient descent algorithm, where the gradient is computed using the multivariable chain rule.
Table of notation
Name | Symbols | Type | Explanation |
---|---|---|---|
The weight that takes the th neuron in layer to the th neuron in layer | Real number | The and seem reversed because this makes more sense in terms of matrix multiplication. The arrow doesn't need to be there and has no deep meaning; it's just a reminder of which neuron goes to which. |
The cost function
The cost function is defined as:
Here are the components of the actual value given input , and are the activations in the final (output) layer.
What is the type of ? In other words, what are the domain and range of ? The cost should be a real number, so the range is , the set of real numbers. As for the domain, we want to take in all the weights and biases of the network. In each layer, the weights can be thought of as a matrix , which has some number of rows and columns. The rows and columns of the weight matrices don't have to be the same across layers (except to make the dimensions check out when multiplying and adding adjacent layers). But what dimension does a list of matrices, each with potentially different dimensions, have? We could somehow take the largest row count and column count, then multiply those together and then by . But then there would be a lot of "blanks" where the dimensions are smaller than this maximum. This approach actually seems very clumsy. The right way to think about the domain is to notice that while the weights tend to be packaged together into matrices for each layer, we can easily "unroll" the weights into a single list. Then we can just count the number of weights (which are just real numbers) in all the layers. If that number is , we have .
The point of backpropagation
The whole point of backpropagation is to find the partial derivatives which tell us how to adjust the weights of the network.
The usual neural network graph
Neural networks are usually illustrated where each node is the activation of neuron in layer . Then the weights are labeled along the edges.
[graph goes here]
Computational graphs
The multivariable chain rule can be represented as a computational graph where the nodes are variables that store values from intermediate computations. Each node can use values given to by edges going into it, and sends its output to nodes going out.
A different neural network graph
Naively, we might try to piece together the information in the previous sections as follows:
- The usual neural network graph looks like a computational graph: it's got nodes storing intermediate values that feed into later layers.
- We know computational graphs are good at representing the process of computing the gradient.
- In order to do gradient descent, we want the gradient of the cost function.
- Therefore, by (1)-(3), we can use the neural network graph to represent the process of the computing gradient.
The problem with the above argument is that the computational graph is only good at representing the process of computing the gradient when the variables we are differentiating with respect to are nodes on the graph. But in our case, the cost function is rather than , i.e. since we are tinkering with the weights of the network, we want to think of the cost as a function of the weights (parameterized by the input ) rather than as a function of the input (parameterized by the weights ).
But the usual neural network graph only has nodes for the activations , so how can we compute the relevant derivatives? One solution is to perform surgery on the existing graph to add the weights as nodes.
[graph goes here]
Now that the weights are on individual nodes, we can think of the activation nodes as receiving all the weights and all the previous activations, then doing the multiplication and passing it through the sigmoid function. This is in contrast to before, where the muliplication was "done along the edges" so that the activation nodes received the products and only did the addition and sigmoid.