Backpropagation derivation using Leibniz notation: Difference between revisions
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
The cost function <math>C</math> depends on <math>w^l_{jk}</math> only through the activation of the <math>j</math>th neuron in the <math>l</math>th layer, i.e. on the value of <math>a^l_j</math>. Thus we can use the chain rule to expand: | The cost function <math>C</math> depends on <math>w^l_{jk}</math> only through the activation of the <math>j</math>th neuron in the <math>l</math>th layer, i.e. on the value of <math>a^l_j</math>. Thus we can use the chain rule to expand: | ||
<math>\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math> | <math display="block">\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math> | ||
We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here. | We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here. | ||
| Line 7: | Line 7: | ||
In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write: | In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write: | ||
<math>\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math> | <math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math> | ||
where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer. | where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer. | ||
| Line 15: | Line 15: | ||
It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_j w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have | It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_j w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have | ||
<math>\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math> | <math display="block">\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math> | ||
Revision as of 22:39, 8 November 2018
The cost function depends on only through the activation of the th neuron in the th layer, i.e. on the value of . Thus we can use the chain rule to expand:
We know that because . We have used the chain rule again here.
In turn, depends on only through the activations of the th layer. Thus we can write:
where is the number of neurons in the th layer.
Backpropagation works recursively starting at the later layers. Since we are trying to compute for the th layer, we can assume inductively that we have already computed .
It remains to find . But so we have