Backpropagation derivation using Leibniz notation: Difference between revisions

From Machinelearning
No edit summary
No edit summary
Line 1: Line 1:
The cost function <math>C</math> depends on <math>w^l_{jk}</math> only through the activation of the <math>j</math>th neuron in the <math>l</math>th layer, i.e. on the value of <math>a^l_j</math>. Thus we can use the chain rule to expand:
The cost function <math>C</math> depends on <math>w^l_{jk}</math> only through the activation of the <math>j</math>th neuron in the <math>l</math>th layer, i.e. on the value of <math>a^l_j</math>. Thus we can use the chain rule to expand:


<math>\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math>
<math display="block">\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math>


We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here.
We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here.
Line 7: Line 7:
In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write:
In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write:


<math>\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
<math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>


where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer.
where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer.
Line 15: Line 15:
It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_j w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have
It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_j w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have


<math>\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math>.
<math display="block">\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math>

Revision as of 22:39, 8 November 2018

The cost function C depends on wjkl only through the activation of the jth neuron in the lth layer, i.e. on the value of ajl. Thus we can use the chain rule to expand:

Cwjkl=Cajlajlwjkl

We know that ajlwjkl=σ(zjl)akl1 because ajl=σ(zjl)=σ(kwjklakl1+bjl). We have used the chain rule again here.

In turn, C depends on ajl only through the activations of the (l+1)th layer. Thus we can write:

Cajl=i{1,,n(l+1)}Cail+1ail+1ajl

where n(l+1) is the number of neurons in the (l+1)th layer.

Backpropagation works recursively starting at the later layers. Since we are trying to compute Cajl for the lth layer, we can assume inductively that we have already computed Cail+1.

It remains to find ail+1ajl. But ail+1=σ(zil+1)=σ(jwijl+1ajl+bil+1) so we have

ail+1ajl=σ(zil+1)wijl+1