Backpropagation derivation using Leibniz notation: Difference between revisions

Latest revision as of 00:26, 9 November 2018

This page presents a derivation/proof of backpropagation derivation using Leibniz notation. Leibniz notation is the most common notation for presenting backpropagation, but it is somewhat complicated due to its blurring of the function/value distinction and its reliance on functional relationships being implicit. Those who prefer function notation may wish to refer to backpropagation derivation using function notation instead of (or in addition to) this page.

Most of the notation on this page is borrowed from Michael Nielsen's book.^[1]

Theorem statement

Let $N$ be a neural network with $L$ layers and $n (l)$ be the number of neurons in layer $l$ for $l \in {1, \dots, L}$ . For $l \in {2, \dots, L}$ , $k \in {1, \dots, n (l - 1)}$ , and $j \in {1, \dots, n (l)}$ let $w_{j k}^{l} \in R$ be the weight from the $k$ th neuron in the $(l - 1)$ th layer to the $j$ th neuron in the $l$ th layer. Let $z_{j}^{l} = \sum_{k = 1}^{n (l - 1)} w_{j k}^{l} a_{k}^{l - 1} + b_{j}^{l}$ and let $a_{j}^{l} = σ (z_{j}^{l})$ , where $σ : R \to R$ is the sigmoid function. Let $C = \frac{1}{2} \sum_{j = 1}^{n (L)} (y_{j} - a_{j}^{L})^{2}$ be a cost function. Then we can calculate the partial derivatives $\frac{\partial C}{\partial w_{j k}^{l}}$ and $\frac{\partial C}{\partial b_{j}^{l}}$ starting from the later layers. Specifically, we have

$\frac{\partial C}{\partial w_{j k}^{l}} = (\sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} σ^{'} (z_{i}^{l + 1}) w_{i j}^{l + 1}) σ^{'} (z_{j}^{l}) a_{k}^{l - 1}$

and

$\frac{\partial C}{\partial b_{j}^{l}} = ? ? ?$

Proof

We induct on the layer number $l$ , starting at $l = L$ . For the base case, we have

$\frac{\partial C}{\partial w_{j k}^{L}} = \frac{\partial C}{\partial a_{j}^{L}} \frac{\partial a_{j}^{L}}{\partial w_{j k}^{L}} = \frac{\partial C}{\partial a_{j}^{L}} σ^{'} (z_{j}^{L}) a_{k}^{L - 1}$

We also have

$\frac{\partial C}{\partial a_{j}^{L}} = a_{j}^{L} - y_{j}$

The cost function $C$ depends on $w_{j k}^{l}$ only through the activation of the $j$ th neuron in the $l$ th layer, i.e. on the value of $a_{j}^{l}$ . Thus we can use the chain rule to expand:

$\frac{\partial C}{\partial w_{j k}^{l}} = \frac{\partial C}{\partial a_{j}^{l}} \frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}}$

We know that $\frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}} = σ^{'} (z_{j}^{l}) a_{k}^{l - 1}$ because $a_{j}^{l} = σ (z_{j}^{l}) = σ (\sum_{k = 1}^{n (l - 1)} w_{j k}^{l} a_{k}^{l - 1} + b_{j}^{l})$ . We have used the chain rule again here.

In turn, $C$ depends on $a_{j}^{l}$ only through the activations of the $(l + 1)$ th layer. Thus we can write (using the chain rule once again):

$\frac{\partial C}{\partial a_{j}^{l}} = \sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} \frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}$

Backpropagation works recursively starting at the later layers. Since we are trying to compute $\frac{\partial C}{\partial a_{j}^{l}}$ for the $l$ th layer, we can assume inductively that we have already computed $\frac{\partial C}{\partial a_{i}^{l + 1}}$ .

It remains to find $\frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}$ . But $a_{i}^{l + 1} = σ (z_{i}^{l + 1}) = σ (\sum_{j = 1}^{n (l)} w_{i j}^{l + 1} a_{j}^{l} + b_{i}^{l + 1})$ so we have

$\frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}} = σ^{'} (z_{i}^{l + 1}) w_{i j}^{l + 1}$

Putting all this together, we obtain

$\begin{array}{r} \frac{\partial C}{\partial w_{j k}^{l}} & = \frac{\partial C}{\partial a_{j}^{l}} \frac{\partial a_{j}^{l}}{\partial w_{j k}^{l}} \\ = (\sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} \frac{\partial a_{i}^{l + 1}}{\partial a_{j}^{l}}) σ' (z_{j}^{l}) a_{k}^{l - 1} \\ = (\sum_{i = 1}^{n (l + 1)} \frac{\partial C}{\partial a_{i}^{l + 1}} σ^{'} (z_{i}^{l + 1}) w_{i j}^{l + 1}) σ' (z_{j}^{l}) a_{k}^{l - 1} \end{array}$

Let us verify that we can calculate the right-hand side. By induction hypothesis, we can calculate $\frac{\partial C}{\partial a_{i}^{l + 1}}$ . We calculate $z_{i}^{l + 1}$ , $z_{j}^{l}$ , and $a_{k}^{l - 1}$ during the forward pass through the network. Finally, $w_{i j}^{l + 1}$ is just a weight in the network, so we already know its value.

References

↑ "Chapter 2: How the backpropagation algorithm works" in Neural Networks and Deep Learning. Michael A. Nielsen. Determination Press. 2015. Retrieved November 8, 2018.

[1] "Chapter 2: How the backpropagation algorithm works" in Neural Networks and Deep Learning. Michael A. Nielsen. Determination Press. 2015. Retrieved November 8, 2018.

[1]

@@ Line 1: / Line 1: @@
+This page presents a '''derivation/proof of backpropagation derivation using Leibniz notation'''. Leibniz notation is the most common notation for presenting backpropagation, but it is somewhat complicated due to its blurring of the function/value distinction and its reliance on functional relationships being implicit. Those who prefer function notation may wish to refer to [[backpropagation derivation using function notation]] instead of (or in addition to) this page.
+Most of the notation on this page is borrowed from Michael Nielsen's book.<ref>[http://neuralnetworksanddeeplearning.com/chap2.html "Chapter 2: How the backpropagation algorithm works"] in ''Neural Networks and Deep Learning''. Michael A. Nielsen. ''Determination Press''. 2015. Retrieved November 8, 2018.</ref>
+==Theorem statement==
+Let <math>N</math> be a neural network with <math>L</math> layers and <math>n(l)</math> be the number of neurons in layer <math>l</math> for <math>l \in \{1, \ldots, L\}</math>. For <math>l \in \{2, \ldots, L\}</math>, <math>k\in\{1, \ldots, n(l-1)\}</math>, and <math>j \in \{1, \ldots, n(l)\}</math> let <math>w^l_{jk} \in \mathbf R</math> be the weight from the <math>k</math>th neuron in the <math>(l-1)</math>th layer to the <math>j</math>th neuron in the <math>l</math>th layer. Let <math>z^l_j = \sum_{k=1}^{n(l-1)} w^l_{jk}a^{l-1}_k + b^l_j</math> and let <math>a^l_j = \sigma(z^l_j)</math>, where <math>\sigma : \mathbf R \to \mathbf R</math> is the [[sigmoid function]]. Let <math>C = \frac12 \sum_{j=1}^{n(L)} (y_j - a^L_j)^2</math> be a cost function. Then we can calculate the partial derivatives <math>\frac{\partial C}{\partial w^l_{jk}}</math> and <math>\frac{\partial C}{\partial b^l_j}</math> starting from the later layers. Specifically, we have
+<math display="block">\frac{\partial C}{\partial w^l_{jk}} = \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \sigma'(z^{l+1}_i)w^{l+1}_{ij}\right) \sigma'(z^l_j)a^{l-1}_k</math>
+and
+<math display="block">\frac{\partial C}{\partial b^l_j} = ???</math>
+==Proof==
+We induct on the layer number <math>l</math>, starting at <math>l=L</math>. For the base case, we have
+<math display="block">\frac{\partial C}{\partial w^L_{jk}} = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial w^L_{jk}} = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)a^{L-1}_k </math>
+We also have
+<math display="block">\frac{\partial C}{\partial a^L_j} = a^L_j - y_j</math>
 The cost function <math>C</math> depends on <math>w^l_{jk}</math> only through the activation of the <math>j</math>th neuron in the <math>l</math>th layer, i.e. on the value of <math>a^l_j</math>. Thus we can use the chain rule to expand:
 <math display="block">\frac{\partial C}{\partial w^l_{jk}} = \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}</math>
-We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_k w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here.
+We know that <math>\frac{\partial a^l_j}{\partial w^l_{jk}} = \sigma'(z^l_j)a^{l-1}_k</math> because <math>a^l_j = \sigma(z^l_j) = \sigma\left(\sum_{k=1}^{n(l-1)} w^l_{jk}a^{l-1}_k + b^l_j\right)</math>. We have used the chain rule again here.
 In turn, <math>C</math> depends on <math>a^l_j</math> only through the activations of the <math>(l+1)</math>th layer. Thus we can write (using the chain rule once again):
-<math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i=1}^n(l+1) \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
+<math display="block">\frac{\partial C}{\partial a^l_j} = \sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}</math>
-where <math>n(l+1)</math> is the number of neurons in the <math>(l+1)</math>th layer.
 Backpropagation works recursively starting at the later layers. Since we are trying to compute <math>\frac{\partial C}{\partial a^l_j}</math> for the <math>l</math>th layer, we can assume inductively that we have already computed <math>\frac{\partial C}{\partial a^{l+1}_i}</math>.
-It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_j w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have
+It remains to find <math>\frac{\partial a^{l+1}_i}{\partial a^l_j}</math>. But <math>a^{l+1}_i = \sigma(z^{l+1}_i) = \sigma\left(\sum_{j=1}^{n(l)} w^{l+1}_{ij}a^l_j + b^{l+1}_i\right)</math> so we have
 <math display="block">\frac{\partial a^{l+1}_i}{\partial a^l_j} = \sigma'(z^{l+1}_i)w^{l+1}_{ij}</math>
@@ Line 19: / Line 41: @@
 Putting all this together, we obtain
-<math display="block">\begin{align}\frac{\partial C}{\partial w^l_{jk}} &= \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \\ &= \left(\sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}\right) \sigma'(z^l_j)a^{l-1}_k \\ &= \left(\sum_{i \in \{1,\ldots,n(l+1)\}} \frac{\partial C}{\partial a^{l+1}_i} \sigma'(z^{l+1}_i)w^{l+1}_{ij}\right) \sigma'(z^l_j)a^{l-1}_k\end{align}</math>
+<math display="block">\begin{align}\frac{\partial C}{\partial w^l_{jk}} &= \frac{\partial C}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \\ &= \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \frac{\partial a^{l+1}_i}{\partial a^l_j}\right) \sigma'(z^l_j)a^{l-1}_k \\ &= \left(\sum_{i=1}^{n(l+1)} \frac{\partial C}{\partial a^{l+1}_i} \sigma'(z^{l+1}_i)w^{l+1}_{ij}\right) \sigma'(z^l_j)a^{l-1}_k\end{align}</math>
+Let us verify that we can calculate the right-hand side. By induction hypothesis, we can calculate <math>\frac{\partial C}{\partial a^{l+1}_i}</math>. We calculate <math>z^{l+1}_i</math>, <math>z^l_j</math>, and <math>a^{l-1}_k</math> during the forward pass through the network. Finally, <math>w^{l+1}_{ij}</math> is just a weight in the network, so we already know its value.
+==References==
+<references/>