Backpropagation Hidden Layer Error

59 Views Asked by At

I'm trying to understand the maths behind backpropagation using this book. I have looked at the formulae the backprop algorithm uses and have worked through their proofs; however, I was wondering whether someone could give me a high-level explanation of the following statement about the formula used to calculate the error of an arbitrary hidden layer $l$:

$\delta^{l} = ((w^{l+1})^{T} \delta^{l + 1}) \odot \sigma'(z^{l})$

where $w^{l+1}$ is the weight matrix of the subsequent layer, $z^{l}$ are the weighted inputs at $l$ and the $\sigma$ is the activation function.

The book states that:

Suppose we know the error $\delta^{l+1}$ at the $l+1^{th}$ layer. When we apply the transpose weight matrix, $(w^{l+1})^{T}$, we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the $l^{th}$ layer. We then take the Hadamard product $\odot \sigma′(z^{l})$. This moves the error backward through the activation function in layer $l$, giving us the error $\delta^{l}$ in the weighted input to layer $l$.

I don't quite grasp how multiplying $\delta ^{l+1}$ with $(w^{l+1})^{T}$ is supposed to "move the error backwards"? Also, I can't understand how taking the Hadamard product of the derivative of $\sigma$ "moves the error backwards through the activation function"? Intuitively, I would have expected the inverse to be more suitable for this.

Any input will be greatly appreciated! Thanks a lot!

1

There are 1 best solutions below

1
On BEST ANSWER

The equation defining layer outputs $z^{l+1}$ in terms of layer outputs $z^l$ is given by

$$\sigma(z^{l+1}) = \sigma(W^{l+1}z^l + b^{l+1})$$ And then the error is given by $$ \begin{align*} \delta^{l} = \frac{\partial C}{\partial z^{l}} = \frac{\partial C}{\partial z^{l}} \cdot \frac{\partial z^l}{\partial z^{l+1}} = W^{l+1} \delta^{l+1} \odot \sigma'(z^{l+1}) \end{align*} $$

So it's just a straightforward application of the chain rule. I agree the notation is somewhat confusing though. I had to read through it a few times to figure it out.