I am trying to understand back propagation algorithm from Andrew Ng's Machine learning course. Here is a pitcure of the slide on which I am stuck.
I know that error in a function $f(x)$ is calculated as:
$f(x+\Delta x) - f(x) = f'(x) * \Delta x$
But I am unable to understand how $\delta^{(3)}$ is calculated as:
$\delta^{(3)} = (\Theta^{(3)})^T * \delta^{(4)} .* g'(z^{(3)})$
where $\delta^{(3)}_j$ is error in node j of layer 3.
EDIT 1: After searching on coursera's discussion forum I found some help.
According to the week 5 lecture notes (under the heading Backpropagation Intuition):
Intuitively, $\delta_j^{(l)}$ is the "error" for $a^{(l)}_j$ (unit j in layer l)
and
More formally, the delta values are actually the derivative of the cost function
$\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)$
Using above information when I was trying to compute $\dfrac{\partial}{\partial z_j^{(l)}} cost(t)$ I got a doubt.
If $\delta_j^{(l)}$ is error in $a^{(l)}_j$ then why is $\delta_j^{(l)}$ defined as partial derivative of cost function with respect to $z^{(3)}_j$ where $z^{(3)}_j=\Theta^{(2)}_ja^{(2)}_j$ and not with respect to $\Theta^{(l)}_j$ or $a^{(2)}_j$
EDIT 2:
On further searching I found the derivation of errors on week 5 lecture notes and now I am stuck at a new place. If $J(\Theta)$ is defined as $J(\Theta)=−ylog(h\Theta(x))−(1−y)log(1−h\Theta(x))$ then how is
$\dfrac{\partial J(\Theta)}{\partial\Theta^{(L−1)}}=\dfrac{\partial J(\Theta)}{\partial a^{(L)}}\dfrac{\partial a^{(L)}}{\partial z^{(L)}}\dfrac{\partial z^{(L)}}{\partial\Theta^{(L−1)}}$
I know that chain rule is used in the above equation but cannot figure out how