Equations for Backpropagation

486 Views Asked by At

I am studying neural network.

From http://neuralnetworksanddeeplearning.com/chap2.html, it says

$\delta^L_j=\frac{\partial C}{\partial z_j^L}=\sum_k \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_j^L}$

Similarly,

$\delta^L_j=\frac{\partial C}{\partial z_j^L}=\sum_k \frac{\partial C}{\partial z_k^{L+1}} \frac{\partial z_k^{L+1}}{\partial z_j^L}$

$l$ is a layer, $\delta^l$ is an error vector, $a^l$ is an activation, and $z^l$ is a weighted input (from the previous layer).

I do not understand why is a partial derivative the sum of other partial derivative chains?

What I think is:

$\sum_k \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_j^L} = \frac{\partial C}{\partial a_1^L} \frac{\partial a_1^L}{\partial z_j^L} + \frac{\partial C}{\partial a_2^L} \frac{\partial a_2^L}{\partial z_j^L} + ... + \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_j^L} =k\frac{\partial C}{\partial z_j^L} $

How can it be:

$\delta^L_j=\sum_k \frac{\partial C}{\partial a_k^L} \frac{\partial a_k^L}{\partial z_j^L}=\frac{\partial C}{\partial z_j^L}$?

Thank you!

1

There are 1 best solutions below

4
On BEST ANSWER

First recall that the chain rule for the composition of a function $f:\mathbb{R}^{m}\longrightarrow\mathbb{R}$ with the $m$ functions $g_1:\mathbb{R}^{n}\longrightarrow\mathbb{R},\ldots,g_m:\mathbb{R}^{n}\longrightarrow\mathbb{R}$ given by $h:\mathbb{R}^{n}\longrightarrow\mathbb{R}$ such that: $$h:=f(g_1,\ldots,g_m)$$ more precisely: $$h(x_1,\ldots,x_n)=f\big(g_1(x_1,\ldots,x_n),\ldots,g_m(x_1,\ldots,x_n)\big)$$ then for each $j$ such that $1\leq j\leq n$ you have: $$\frac{\partial h}{\partial x_j}(x_1,\ldots,x_n) =\frac{\partial f}{\partial g_1}\frac{\partial g_1}{\partial x_j}+\cdots+\frac{\partial f}{\partial g_m}\frac{\partial g_m}{\partial x_j}$$ $$=\nabla{f}\big(g_{1}(x_1,\ldots,x_n),\ldots,g_{m}(x_1,\ldots,x_n)\big)\,\bullet\left(\frac{\partial g_1}{\partial x_j},\ldots,\frac{\partial g_m}{\partial x_j}\right)$$

Is a consequence of the chain rule for several variables. Your cost function $C$ depends on several output activations $a_1,a_2,\ldots,a_m$ and each activation function $a_k$ depend on the variables $z_1,\ldots,z_n$:, $$C=C(a_1,\ldots,a_m)$$ also each activation has several inputs $z_j$ as below: $$ a_{k}=a_{k}(z_1,\ldots,z_n) $$ where $n$ is the number of inputs of the layer and $m$ is the number of outputs of the same layer. So, the cost function depends on $z_1,\ldots,z_n$ by the following composition with the activations $a_1,\ldots,a_m$: says that

$$C=C(z_1,\ldots,z_n) =C(a_{1}(z_1,\ldots,z_n),\ldots,a_{m}(z_1,\ldots,z_n))$$ this implies by the chain rule that $$\delta_j:=\frac{\partial C}{\partial z_j}=\sum_{k=1}^{m}\frac{\partial C}{\partial a_k}\frac{\partial a_k}{\partial z_j}$$ This happens for each layer $L$ , so you can write the following: $$\delta_{j}^{L}:=\frac{\partial C}{\partial z_{j}^{L}}=\sum_{k=1}^{m}\frac{\partial C}{\partial a_{k}^{L}}\frac{\partial a_{k}^{L}}{\partial z_{j}^{L}}$$