Backpropagation derivation using Leibniz notation: Difference between revisions

Revision as of 22:47, 8 November 2018

The cost function $C$ depends on $w_{jk}^{l}$ only through the activation of the $j$ th neuron in the $l$ th layer, i.e. on the value of $a_{j}^{l}$ . Thus we can use the chain rule to expand:

{\frac {\partial C}{\partial w_{jk}^{l}}}={\frac {\partial C}{\partial a_{j}^{l}}}{\frac {\partial a_{j}^{l}}{\partial w_{jk}^{l}}}

We know that ${\frac {\partial a_{j}^{l}}{\partial w_{jk}^{l}}}=\sigma '(z_{j}^{l})a_{k}^{l-1}$ because $a_{j}^{l}=\sigma (z_{j}^{l})=\sigma \left(\sum _{k}w_{jk}^{l}a_{k}^{l-1}+b_{j}^{l}\right)$ . We have used the chain rule again here.

In turn, $C$ depends on $a_{j}^{l}$ only through the activations of the $(l+1)$ th layer. Thus we can write (using the chain rule once again):

{\frac {\partial C}{\partial a_{j}^{l}}}=\sum _{i=1}^{n}(l+1){\frac {\partial C}{\partial a_{i}^{l+1}}}{\frac {\partial a_{i}^{l+1}}{\partial a_{j}^{l}}}

where $n(l+1)$ is the number of neurons in the $(l+1)$ th layer.

Backpropagation works recursively starting at the later layers. Since we are trying to compute ${\frac {\partial C}{\partial a_{j}^{l}}}$ for the $l$ th layer, we can assume inductively that we have already computed ${\frac {\partial C}{\partial a_{i}^{l+1}}}$ .

It remains to find ${\frac {\partial a_{i}^{l+1}}{\partial a_{j}^{l}}}$ . But $a_{i}^{l+1}=\sigma (z_{i}^{l+1})=\sigma \left(\sum _{j}w_{ij}^{l+1}a_{j}^{l}+b_{i}^{l+1}\right)$ so we have

{\frac {\partial a_{i}^{l+1}}{\partial a_{j}^{l}}}=\sigma '(z_{i}^{l+1})w_{ij}^{l+1}

Putting all this together, we obtain

{\begin{aligned}{\frac {\partial C}{\partial w_{jk}^{l}}}&={\frac {\partial C}{\partial a_{j}^{l}}}{\frac {\partial a_{j}^{l}}{\partial w_{jk}^{l}}}\\&=\left(\sum _{i\in \{1,\ldots ,n(l+1)\}}{\frac {\partial C}{\partial a_{i}^{l+1}}}{\frac {\partial a_{i}^{l+1}}{\partial a_{j}^{l}}}\right)\sigma '(z_{j}^{l})a_{k}^{l-1}\\&=\left(\sum _{i\in \{1,\ldots ,n(l+1)\}}{\frac {\partial C}{\partial a_{i}^{l+1}}}\sigma '(z_{i}^{l+1})w_{ij}^{l+1}\right)\sigma '(z_{j}^{l})a_{k}^{l-1}\end{aligned}}

@@ Line 7: / Line 7: @@
 In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write (using the chain rule once again):
-<math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
+<math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i=1}^n(l+1) \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
 where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer.