I'm working on understanding all the math used in artificial neural networks. I have gotten stuck at calculating the error function derivatives for hidden layers when performing backpropagation.
On page 244 of Bishop's "Pattern recognition and machine learning", formula 5.55. The derivative of the error function for a hidden layer is given using a sum of derivatives over all units to which it sends connections.
$$ \frac{\partial E_n}{\partial a_j} = \sum_k \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$
I know the chain rule. If $a_j$ goes into only one other node, we can apply the chain rule to separate the parts. But what is the intuition behind summing these values for all nodes if the output goes into multiple nodes?
Thanks