Deriving the error in activation nodes in back propagation algorithm

112 Views Asked by At

I am trying to understand back propagation algorithm from Andrew Ng's Machine learning course. Here is a pitcure of the slide on which I am stuck.

I know that error in a function $f(x)$ is calculated as:

$f(x+\Delta x) - f(x) = f'(x) * \Delta x$

But I am unable to understand how $\delta^{(3)}$ is calculated as:

$\delta^{(3)} = (\Theta^{(3)})^T * \delta^{(4)} .* g'(z^{(3)})$

where $\delta^{(3)}_j$ is error in node j of layer 3.

EDIT 1: After searching on coursera's discussion forum I found some help.

According to the week 5 lecture notes (under the heading Backpropagation Intuition):

Intuitively, $\delta_j^{(l)}$ is the "error" for $a^{(l)}_j$ (unit j in layer l)

and

More formally, the delta values are actually the derivative of the cost function

$\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)$

Using above information when I was trying to compute $\dfrac{\partial}{\partial z_j^{(l)}} cost(t)$ I got a doubt.

If $\delta_j^{(l)}$ is error in $a^{(l)}_j$ then why is $\delta_j^{(l)}$ defined as partial derivative of cost function with respect to $z^{(3)}_j$ where $z^{(3)}_j=\Theta^{(2)}_ja^{(2)}_j$ and not with respect to $\Theta^{(l)}_j$ or $a^{(2)}_j$

EDIT 2:

On further searching I found the derivation of errors on week 5 lecture notes and now I am stuck at a new place. If $J(\Theta)$ is defined as $J(\Theta)=−ylog(h\Theta(x))−(1−y)log(1−h\Theta(x))$ then how is

$\dfrac{\partial J(\Theta)}{\partial\Theta^{(L−1)}}=\dfrac{\partial J(\Theta)}{\partial a^{(L)}}\dfrac{\partial a^{(L)}}{\partial z^{(L)}}\dfrac{\partial z^{(L)}}{\partial\Theta^{(L−1)}}$

I know that chain rule is used in the above equation but cannot figure out how