Derivation of partial derivative of cost function with respect to weights in backpropagation algorithm

388 Views Asked by At

I am studying Machine Learning from Andrew Ng's Machine Learning course on coursera. I am stuck at understanding math behind back propagation.

Here is an image of backpropagation algorithm from his course. Here is the image of backpropagation algorithm which I am studying I am able to understand the derivation of $\delta$ terms from his course notes but the derivation of $\Delta^{(l)}=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^{T}$ is not given.

My questions:

  1. What is the meaning of $\Delta^{(l)}$ and how is $\Delta^{(l)}=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^{T}$

2. What is the meaning of $\require{enclose}\enclose{horizontalstrike}{D_{i,j}^{(l)}}$ and how is it equal to $\require{enclose} \enclose{horizontalstrike}{D_{i,j}^{(l)}:=\dfrac{1}{m}(\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)})}$ if $\require{enclose} \enclose{horizontalstrike}{j\ne0}$ and $\require{enclose} \enclose{horizontalstrike}{D_{i,j}^{(l)}:=\dfrac{1}{m}(\Delta_{i,j}^{(l)}}$ if $\require{enclose} \enclose{horizontalstrike}{j=0}$

3. How is $\require{enclose} \enclose{horizontalstrike}{D_{i,j}^{(l)}=\dfrac{\partial J(\Theta)}{\partial D_{i,j}^{(l)}}}$.

EDIT: After referring to this answer on stats.stackexchange.com I now understand that $D_{i,j}^{(l)}$ is the averaged error of weight $\Theta_{i,j}^{(l)}$ over all the training set. So the only part left which I do not understand is why are we adding $\delta^{(l+1)}(a^{(l)})^{T}$ to $\Delta^{(l)}$?

Please give answers with mathematical derivations.