Backpropagation derivation using Leibniz notation: Difference between revisions

Revision as of 22:39, 8 November 2018

The cost function $C$ depends on $w_{j k}^{l}$ only through the activation of the $j$ th neuron in the $l$ th layer, i.e. on the value of $a_{j}^{l}$ . Thus we can use the chain rule to expand:

$\frac{\partial C}{\partial w_{j k}^{l}} = \frac{\partial C}{\partial a_{j}^{l}} \frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}}$

We know that $\frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}} = σ^{'} (z_{j}^{l}) a_{k}^{l - 1}$ because $a_{j}^{l} = σ (z_{j}^{l}) = σ (\sum_{k} w_{j k}^{l} a_{k}^{l - 1} + b_{j}^{l})$ . We have used the chain rule again here.

In turn, $C$ depends on $a_{j}^{l}$ only through the activations of the $(l + 1)$ th layer. Thus we can write:

$\frac{\partial C}{\partial a_{j}^{l}} = \sum_{i \in {1, \dots, n (l + 1)}} \frac{\partial C}{\partial a_{i}^{l + 1}} \frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}$

where $n (l + 1)$ is the number of neurons in the $(l + 1)$ th layer.

Backpropagation works recursively starting at the later layers. Since we are trying to compute $\frac{\partial C}{\partial a_{j}^{l}}$ for the $l$ th layer, we can assume inductively that we have already computed $\frac{\partial C}{\partial a_{i}^{l + 1}}$ .

It remains to find $\frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}$ . But $a_{i}^{l + 1} = σ (z_{i}^{l + 1}) = σ (\sum_{j} w_{i j}^{l + 1} a_{j}^{l} + b_{i}^{l + 1})$ so we have

$\frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}} = σ^{'} (z_{i}^{l + 1}) w_{i j}^{l + 1}$

@@ Line 1: / Line 1: @@
 The cost function <math>C</math> depends on <math>w^l_{jk}</math> only through the activation of the <math>j</math>th neuron in the <math>l</math>th layer, i.e. on the value of <math>a^l_j</math>. Thus we can use the chain rule to expand:
-<math>\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math>
+<math display="block">\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math>
 We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here.
@@ Line 7: / Line 7: @@
 In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write:
-<math>\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
+<math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
 where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer.
@@ Line 15: / Line 15: @@
 It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_j w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have
-<math>\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math>.
+<math display="block">\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math>