I saw it elsewhere as well but it was not explained how they come up with it, especially the part about dot producting the error of the output layer with theta connecting output layer and hidden layers. I know chain rule to find gradients for weights and bias at any layer but this little twist has stumped me as I am unable to extrapolate my knowledge of calculus to this new situation.
The formula is below and appears in week 5 of the ML course on Coursera.
I am on my mobile and still learning Mathjax so I apologize beforehand but the formula says (assuming L4 is output layer):
Error at L3 = theta3.transpose dot Error at L4 * a3 .*(1-a3)
Where theta 3 are weights between layers 3 and 4 and a3 is the output of layer 3 with a sigmoid activation. Also.* is the element wise multiplication and dot is dot product.
I think Andrew uses sigmoid on every layer for this example.
Edit: I found the same question on Quora and again an answer without a thorough explanation.
For example why can we just multiply the error term in layer L by theta to back propagate to error term in layer L-1? Where as in chain rule we just take partial derivatives along the path to a parameter we want the gradient for, this equation I have no idea how doing what they did can attribute error to particular nodes.
Also how can it be back prop if he skips the sigmoid on the final output layer and multiplies the final error by theta connecting to hidden layer?
That is not chain rule and I cannot find an explanation anywhere.
Honestly I feel that this course is geared for people who want to learn ML without the math. I wish there was more math or at least links to resources to explain things like this.
I found an answer here: hidden layer error for nn
I want to delete this but someone has upvoted it so perhaps the link and brief explanation will help someone.
Basically, The error in hidden layers is defined as the partial derivative of the cost with respect to the input "z" at that layer.
When you apply the chain rule and do some substitutions, you can get exactly what Andrew got.
Because I am not good with mathjax, I am posting two photos. But I believe that other answer on Cross-Validated Stack Exchange is a good one (except the notation where it says theta for layer (L+1), I subbed in theta for layer(L)). There is an alternate answer that says something about the notation and I think that person is correct. But I guess it might depend on whether you view the theta between for example, layers 3 and 4 belonging to 3 or 4.
I assumed calculations could be vectorized so my notes are much cleaner without many of the subscripts in the answer online. I am basically just showing the chain rule from layer to layer rather than to specific nodes.