Backpropagation derivation using Leibniz notation: Difference between revisions

Revision as of 23:02, 8 November 2018

Throughout this page, let $n (l)$ be the number of neurons in the $l$ th layer of the neural network.

The cost function $C$ depends on $w_{j k}^{l}$ only through the activation of the $j$ th neuron in the $l$ th layer, i.e. on the value of $a_{j}^{l}$ . Thus we can use the chain rule to expand:

$\frac{\partial C}{\partial w_{j k}^{l}} = \frac{\partial C}{\partial a_{j}^{l}} \frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}}$

We know that $\frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}} = σ^{'} (z_{j}^{l}) a_{k}^{l - 1}$ because $a_{j}^{l} = σ (z_{j}^{l}) = σ (\sum_{k = 1}^{n (l - 1)} w_{j k}^{l} a_{k}^{l - 1} + b_{j}^{l})$ . We have used the chain rule again here.

In turn, $C$ depends on $a_{j}^{l}$ only through the activations of the $(l + 1)$ th layer. Thus we can write (using the chain rule once again):

$\frac{\partial C}{\partial a_{j}^{l}} = \sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} \frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}$

Backpropagation works recursively starting at the later layers. Since we are trying to compute $\frac{\partial C}{\partial a_{j}^{l}}$ for the $l$ th layer, we can assume inductively that we have already computed $\frac{\partial C}{\partial a_{i}^{l + 1}}$ .

It remains to find $\frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}$ . But $a_{i}^{l + 1} = σ (z_{i}^{l + 1}) = σ (\sum_{j} w_{i j}^{l + 1} a_{j}^{l} + b_{i}^{l + 1})$ so we have

$\frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}} = σ^{'} (z_{i}^{l + 1}) w_{i j}^{l + 1}$

Putting all this together, we obtain

$\begin{array}{r} \frac{\partial C}{\partial w_{j k}^{l}} & = \frac{\partial C}{\partial a_{j}^{l}} \frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}} \\ = (\sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} \frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}) σ' (z_{j}^{l}) a_{k}^{l - 1} \\ = (\sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} σ^{'} (z_{i}^{l + 1}) w_{i j}^{l + 1}) σ' (z_{j}^{l}) a_{k}^{l - 1} \end{array}$

Let us verify that we can calculate the right-hand side. We by induction hypothesis, we can calculate $\frac{\partial C}{\partial a_{i}^{l + 1}}$ . We calculate $z_{i}^{l + 1}$ , $z_{j}^{l}$ , and $a_{k}^{l - 1}$ during the forward pass through the network. Finally, $w_{i j}^{l + 1}$ is just a weight in the network, so we already know its value.

Revision as of 22:53, 8 November 2018 (view source) IssaRice (talk \| contribs) No edit summary ← Older edit		Revision as of 23:02, 8 November 2018 (view source) IssaRice (talk \| contribs) No edit summary Newer edit →
Line 20:		Line 20:

	<math display="block">\begin{align}\frac{\partial C}{\partial w^l_{jk}} &= \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \\ &= \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}\right) \sigma'(z^l_j)a^{l-1}_k \\ &= \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \sigma'(z^{l+1}_i)w^{l+1}_{ij}\right) \sigma'(z^l_j)a^{l-1}_k\end{align}</math>		<math display="block">\begin{align}\frac{\partial C}{\partial w^l_{jk}} &= \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \\ &= \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}\right) \sigma'(z^l_j)a^{l-1}_k \\ &= \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \sigma'(z^{l+1}_i)w^{l+1}_{ij}\right) \sigma'(z^l_j)a^{l-1}_k\end{align}</math>

			Let us verify that we can calculate the right-hand side. We by induction hypothesis, we can calculate <math>\frac{\partial C}{\partial a^{l+1}_i}</math>. We calculate <math>z^{l+1}_i</math>, <math>z^l_j</math>, and <math>a^{l-1}_k</math> during the forward pass through the network. Finally, <math>w^{l+1}_{ij}</math> is just a weight in the network, so we already know its value.