This page presents a derivation/proof of backpropagation derivation using Leibniz notation. Leibniz notation is the most common notation for presenting backpropagation, but it is somewhat complicated due to its blurring of the function/value distinction and its reliance on functional relationships being implicit. Those who prefer function notation may wish to refer to backpropagation derivation using function notation instead of (or in addition to) this page.
Most of the notation on this page is borrowed from Michael Nielsen's book.[1]
Theorem statement
Let
be a neural network with
layers and
be the number of neurons in layer
for
. For
,
, and
let
be the weight from the
th neuron in the
th layer to the
th neuron in the
th layer. Let
and let
, where
is the sigmoid function. Let
be a cost function. Then we can calculate the partial derivatives
and
starting from the later layers. Specifically, we have

and

Proof
We induct on the layer number
, starting at
. For the base case, we have

We also have

The cost function
depends on
only through the activation of the
th neuron in the
th layer, i.e. on the value of
. Thus we can use the chain rule to expand:

We know that
because
. We have used the chain rule again here.
In turn,
depends on
only through the activations of the
th layer. Thus we can write (using the chain rule once again):

Backpropagation works recursively starting at the later layers. Since we are trying to compute
for the
th layer, we can assume inductively that we have already computed
.
It remains to find
. But
so we have

Putting all this together, we obtain

Let us verify that we can calculate the right-hand side. By induction hypothesis, we can calculate
. We calculate
,
, and
during the forward pass through the network. Finally,
is just a weight in the network, so we already know its value.
References