How does using chain rule in backprogation algorithm works?

182 Views Asked by At

Let's have a simple error function $E(z) =\frac{1}{2} \times (a - y(z))^2$

How come $\frac{\partial E}{\partial z} = \frac{\mathrm dy}{\mathrm dz} \frac{\partial E}{\partial y}$ ?

In other words does it not mean that $\frac{\partial E}{\partial z} = \frac{\partial E}{\partial y(z)}$? If so, why?

Since using chain rule $\frac{\partial E}{\partial y(z)} = \frac{\mathrm dy}{\mathrm dz} \frac{\partial E}{\partial y}$? But does not seem to make sense since that would mean that $\frac{\mathrm dy}{\mathrm dz} = 1$

This formulas are coming from backpropagation algorithm for neural network.

y is sigmoid function.

1

There are 1 best solutions below

1
On BEST ANSWER

The chain rule is given by theory, you can accept it as it is. For further reading on the chain rule, you can start from Wikipedia.

Coming back to your question, you say that: $$\frac{\partial y}{\partial z} = 1 .$$

I guess you mean "for all $z$". Then, this is only true if $y(z) = z + b$, for some $b$. Hence, only when $y(z) = z + b$, you can conclude, by applying the chain rule, that:

$$\frac{\partial E}{\partial z} = \frac{\partial E}{\partial y}.$$

This is no true in general. For example, suppose that:

$$y = \cos(z).$$

Then:

$$E(z) = \frac{1}{2} (a - y(z))^2 = \frac{1}{2} (a - \cos(z))^2.$$

The derivative with respect to $z$ is:

$$\frac{\partial E(z)}{\partial z} = 2\frac{1}{2}(-\sin(z))(a-\cos(z)) = - \sin(z)(a-\cos(z)).$$

Let's try to work with the chain rule. We know that:

$$\begin{cases} \displaystyle\frac{\partial y}{\partial z} = -\sin(z)\\ \displaystyle\frac{\partial E}{\partial y} = 2\frac{1}{2}(a-y(z)) = a-\cos(z) \end{cases} $$

Finally:

$$\frac{\partial E}{\partial z} = \frac{\partial y}{\partial z} \frac{\partial E}{\partial y} = -\sin(z)(a-\cos(z)).$$

So, you come up to the same result.


In the context of neural networks, you may want to set $y(z)$ equal to the sigmoid function as a function which describes the activation of a neuron. This function has a very nice property. Indeed:

$$\frac{\partial y(z)}{\partial z} = y(z)(1-y(z)).$$

Therefore, you don't need to evaluate $\frac{\partial y}{\partial z}$, since you have it freely knowing only $y(z)$. In this case, you get that:

$$\frac{\partial E(z)}{\partial z} = y(z)(1-y(z))2\frac{1}{2}(a-y(z)) = y(z)(1-y(z))(a-y(z)).$$

In particular, you know the value of $y(z)$ since it corresponds to the activation of a neuron. $a-y(z)$ is the error that you want to backpropagate. Using the sigmoid function, you can avoid to find the derivative of $y(z)$ explicitly, thus obtaining a fast algorithm. Instead, if you need to evaluate explicitly the derivative each time, the algorithm can be slower.