Backpropagation derivation using Leibniz notation

From Machinelearning
Revision as of 00:26, 9 November 2018 by IssaRice (talk | contribs) (→‎Proof)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This page presents a derivation/proof of backpropagation derivation using Leibniz notation. Leibniz notation is the most common notation for presenting backpropagation, but it is somewhat complicated due to its blurring of the function/value distinction and its reliance on functional relationships being implicit. Those who prefer function notation may wish to refer to backpropagation derivation using function notation instead of (or in addition to) this page.

Most of the notation on this page is borrowed from Michael Nielsen's book.[1]

Theorem statement

Let N be a neural network with L layers and n(l) be the number of neurons in layer l for l{1,,L}. For l{2,,L}, k{1,,n(l1)}, and j{1,,n(l)} let wjklR be the weight from the kth neuron in the (l1)th layer to the jth neuron in the lth layer. Let zjl=k=1n(l1)wjklakl1+bjl and let ajl=σ(zjl), where σ:RR is the sigmoid function. Let C=12j=1n(L)(yjajL)2 be a cost function. Then we can calculate the partial derivatives Cwjkl and Cbjl starting from the later layers. Specifically, we have

Cwjkl=(i=1n(l+1)Cail+1σ(zil+1)wijl+1)σ(zjl)akl1

and

Cbjl=???

Proof

We induct on the layer number l, starting at l=L. For the base case, we have

CwjkL=CajLajLwjkL=CajLσ(zjL)akL1

We also have

CajL=ajLyj

The cost function C depends on wjkl only through the activation of the jth neuron in the lth layer, i.e. on the value of ajl. Thus we can use the chain rule to expand:

Cwjkl=Cajlajlwjkl

We know that ajlwjkl=σ(zjl)akl1 because ajl=σ(zjl)=σ(k=1n(l1)wjklakl1+bjl). We have used the chain rule again here.

In turn, C depends on ajl only through the activations of the (l+1)th layer. Thus we can write (using the chain rule once again):

Cajl=i=1n(l+1)Cail+1ail+1ajl

Backpropagation works recursively starting at the later layers. Since we are trying to compute Cajl for the lth layer, we can assume inductively that we have already computed Cail+1.

It remains to find ail+1ajl. But ail+1=σ(zil+1)=σ(j=1n(l)wijl+1ajl+bil+1) so we have

ail+1ajl=σ(zil+1)wijl+1

Putting all this together, we obtain

Cwjkl=Cajlajlwjkl=(i=1n(l+1)Cail+1ail+1ajl)σ'(zjl)akl1=(i=1n(l+1)Cail+1σ(zil+1)wijl+1)σ'(zjl)akl1

Let us verify that we can calculate the right-hand side. By induction hypothesis, we can calculate Cail+1. We calculate zil+1, zjl, and akl1 during the forward pass through the network. Finally, wijl+1 is just a weight in the network, so we already know its value.

References

  1. "Chapter 2: How the backpropagation algorithm works" in Neural Networks and Deep Learning. Michael A. Nielsen. Determination Press. 2015. Retrieved November 8, 2018.