I'm currently trying to teach myself how to code a neural network and I want as good of an understanding of the math as is possible for someone who got a D- in calculus over 15 years ago and has not taken a math class since. Suffice it to say I am quite a math beginner but have been reading/watching videos for weeks now. I have an ok understanding of much of what goes into the algorithms but - largely because I struggle with the Chain Rule - I am absolutely stumped by the backpropagation algorithm, specifically how the to derive the error term.
All of the equations I'm using can be found here:
This is what I know:
This is the error equation representing sum of squared error. $E = \frac{1}{2} $${\sum\limits_{k}} (t_k - a_k)^2 $
In a neural network basically you have an output that is associated with some weights. We want to change the weights to reduce the error. $ \bigtriangleup W \propto- \frac{\partial E}{\partial W} $
Ok fine so far. ${\partial E}$ is not directly related to the weights so I get we have to use the chain rule here.
'E' is a function of $a_k$ which is itself a function of a net input ($net_k$) which is ITSELF a function of the weights. So According to the link ultimately the equation for backpropagation is as follows:
${\frac{\partial E}{\partial a_k}}\times {\frac{\partial a_k}{\partial net_k}}\times {\frac{\partial net_k}{\partial w_{jk}}} $
Ok so I get ${\frac{\partial E}{\partial a_k}}$ - you use the power rule. You multiply 2 x .5 for it to cancel and the summation also cancels so the derivative here is just $-(t_k - a_k)$. I get this so far.
So $a_k$ is the activation function which for a sigmoid is ${\frac{1}{1+e^{-z}}}$ and this is where I lose it. So the sigmoid derivative is $a_k(1-a_k)$ but I have no idea why. The link presents the sigmoid equation as ${(1+e^{-net_k})^{-1}}$ which leads to ${\frac{\partial a_k}{\partial net_k}} = {\frac{e^{-net_k}}{(1+e^{-net_k})^{2}}}$ but I don't understand this derivation. Could someone walk me through it, possibly by explaining what rules you're using to get here?
Finally, when performing ${\frac{\partial net_k}{\partial w_{jk}}}$ the article states:
Note that only one term of the net summation will have a non-zero derivative: again the one associated with the particular weight we are considering.
This means that certain weights are going to have derivatives of 0 so we only need to consider...I think...the weights on...'j'? Clearly I'm not sure here. Then they state: ${\frac{\partial net_k}{\partial w_{jk}}} = {\frac{\partial(w_{kj}a_j)}{\partial{w_{kj}}} = a_j}$ Conceptually I understand the idea behind partial derivatives is to take whatever is not in the equation and treat it as a constant so in this example we are treating $a_j$ as a constant. Why then is the derivative $a_j$? Shouldn't it be derived as 0? Additionally, where did the w's go? What kind of rules are being used in the derivation of these variables? I'm not sure if one uses the power rule/quotient rule/etc. (obviously not but just as examples) to get this result.
And then to get the actual weight change you just multiply all these results right?
I am clearly an absolute math idiot. If someone could come down to my level and help me out here I would appreciate it beyond words.
The derivative of that sigmoid is a good example for where we can practice the chain rule.
$$\cases{f(x) = \frac 1 {1+e^{-x}}\\ f(g) =\frac 1 {g}\\g(h) = 1+e^{h(x)}\\ h(x)=-x}$$
$$\frac {\partial{f}}{\partial x} = \frac{\partial f}{\partial g}\times\frac{\partial g}{\partial h}\times \frac{\partial h}{\partial x}$$
The first one is $-1/g^2$ the second is $e^{h(x)}$, third $-1$ so substituting and simplifying
$$-\frac {-e^{-x}} {(1+e^{-x})^2} =\frac {e^{-x}} {(1+e^{-x})^2} $$
This is equal if we set $x = net_k$.