I'm following along and trying to prove the 4 equations for back propagation in neural networks in chapter 2 of this book:
Prove: \begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}
However, I'm stuck on the very first step of this one, in which $\delta^l_j$ (error in activation of neuron $j$ in layer $l$) at a more advanced layer is related to the error in a previous layer:
\begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \tag{42}\end{eqnarray}
The chain rule is supposed to be at play here, and I certainly can see its form at work. However, the discrete nature of the problem has me a bit turned around. Can someone simply explain the first step to me?
As for the work I myself have done, it is relegated to a single self-contained layer, which is the first proof of the chapter. I was able to do it without reading ahead, which made me happy:
by definition, $C$ is a function of the activation $a^l_j = \sigma(z^l_j)$ where $z^l_j$ is the linear combination of weights and biases $z^l_j=\sum_k w^l_{jk}a^{l-1}_k + b_j$. So, by the chain rule, it's pretty trivial for me to say:
\begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \\ & = & \frac{\partial C}{\partial a^l_j}\frac{\partial a^l_j}{\partial z^l_j}\\ & = & \frac{\partial C}{\partial a^l_j}\sigma'(z^l_j)\end{eqnarray}
Your explanations, hints, pokes and prods are appreciated. This is not homework, so feel free to use as much or as little detail as you desire.
Edit in response to comment: @jeea, I've written $z^{l+1}_j = \sum_k w^{l+1}_{jk} \sigma(z^l_k)+ b^{l+1}_j$, which is the only way I can see to write one in terms of the previous. How this relates to the proof, I still cannot see.