I was going through the equations of Backpropagation in Andrew Ng's Deep Learning course and I got these set of equations for a two layer Neural Network:
$dZ^{[2]} = A^{[2]} - y$
$dW^{[2]} = 1 / m \space\space dZ^{[2]}\space A^{[1]T}$
$dZ^{[1]} = W^{[2]T} dZ^{[2]} g^{[1]'}(Z^{[1]})$
$dW^{[1]} = 1/m \space\space dZ^{[1]} \space X^{[T]}$
Where
$A^{[i]}$ is the activation values for the $i^{th}$ layer.
$y$ is the target value.
$Z^{[i]}$ is the input for the $i^{th}$ layer.
$W^{[i]}$ is the weight between the $i^{th}$ layer and the $(i-1)^{th}$ layer.
$g^{[i]}()$ is the activation function for the $i^{th}$ layer.
$X$ is the input to the neural network.
I've intentionally ignored the bias terms to make it simpler.
I do understand that the first equation represents the error in the last layer, the second equation is derived from ${\space}{\partial}E^{[2]}/{\partial}W^{[2]}\,$ when $E = - (1/m {\space}[y \log a^{[2]} + (1 - y) \log(1 - a^{[2]})])$ given the activation function is a sigmoid function.
I would like to know the formal derivation for the third equation.