Why does a $\sum_k$ appear when using the chain rule to derive $\delta^L_j?$

31 Views Asked by At

I'm following along this book on machine learning.

At the moment, the author is proving that

\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}

  • $\delta$ is the output error of the $j^{\rm th}$ sigmoid neuron in the $l^{\rm th}$ layer ($L$ is the last layer in the neural network)
  • $C$ is the cost for a single training example
  • $a$ is the output from the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer
  • $z$ is the unweighted input for the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer
  • $\sigma'$ is the derivative of the sigmoid function

I'm concerned with $(2)$ in the author's proof:

\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial z^L_j}. \tag{1}\end{eqnarray}

\begin{eqnarray} \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j}, \tag{2}\end{eqnarray}

\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}. \tag{3}\end{eqnarray}

\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j), \tag{4}\end{eqnarray}

The author's explanation for the simplification from step 2 to 3 is:

the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends only on the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k=j$. And so $\partial a^L_k / \partial z^L_j$ vanishes when $k≠j$.

Where does the $\sum_k$ come from in $(2)$? Why can't we just skip straight from $(1)$ to $(3)$ and ignore the $\sum_k$ used in $(2)$?

1

There are 1 best solutions below

2
On BEST ANSWER

The author is using the chain rule. We want the partial of $C$ with respect to the $z$'s, but we only know the partial of $C$ with respect to the $a$'s. It is an important claim about the model that each $a$ only depends on the corresponding $z$. The bullets you quote support that claim because the output of a given neuron should only depend on the inputs to that neuron but if you represent $C$ as a function of $a$'s you need this step, then to argue that the second partial is zero unless $j=k$.