How to correctly backpropagate in neural networks?

92 Views Asked by At

I am trying to implement neural network without using any external libraries to understand it better, before implementation I made some calculations, but I am not $100\%$ certain they are correct.

I am using notations:

$b_i^{(L)}$ = bias in i-th neuron in L-th layer

$w_{xy}^{(L)}$ = weight in L-th layer comming from x-th neuron in previous layer to y-th neuron in current layer

$e_{xy}$ = expected output from last layer at y-th neuron from x-th training data

$a_{xy}^{(L)}$ = result of x-th training data at L-th layer in y-th neuron

$z_{xy}^{(L)}$ = result of x-th training data at L-th layer in y-th neuron before activation function. I am using sigmoid (S(x)), and therefore $a_{xy}^{(L)} = S(z_{xy}^{(L)})$; $z_{xy}^{(L)} = b_y^{(L)} + \sum\limits_{i=0}^{|L-1|}w_{iy}^{(L)}a_{xi}^{(L-1)}$

$C_i$ = cost function of i-th training data ($=\sum\limits_{j=1}^{|L_{last}|}(a_{ij}^{(L)} - e_{ij})^2$ )

$C$ = total cost as average of all training data of size T (in current mini-batch). ($C = \frac1T*\sum\limits_{j=1}^{T}C_i$ )

$|L_i|$ = count of neurons in i-th layer

These are my calculations I am trying to get gradient of for backpropagation:

$\frac{\partial{C_i}}{\partial{b_x^{(L_{last})}}} = (2a_{ix}^{(L_{last})}-2e_{ix}) * S'(z_{ix}^{(L_{last})})$

$\frac{\partial{C_i}}{\partial{w_{xy}^{(L_{last})}}} = (2a_{iy}^{(L_{last})}-2e_{iy}) * S'(z_{iy}^{(L_{last})}) * a_{ix}^{(L_{last}-1)}$

$\frac{\partial{C_i}}{\partial{a_{ix}^{(L_{last}-1)}}} = \sum\limits_{j=1}^{|L_{last}|}[(2a_{ij}^{(L_{last})}-2e_{ij}) * S'(z_{ij}^{(L_{last})}) * w_{xj}^{(L_{last})}]$


$\frac{\partial{C_i}}{\partial{b_x^{(L_{last}-1)}}} = \frac{\partial{C_i}}{\partial{a_{ix}^{(L_{last}-1)}}} * S'(z_{ix}^{(L_{last}-1)})$

$\frac{\partial{C_i}}{\partial{w_{xy}^{(L_{last}-1)}}} = \frac{\partial{C_i}}{\partial{a_{iy}^{(L_{last}-1)}}}* S'(z_{iy}^{(L_{last}-1)}) * a_{ix}^{(L_{last}-2)}$

$\frac{\partial{C_i}}{\partial{a_{ix}^{(L_{last}-2)}}} = \sum\limits_{j=1}^{|L_{last}-1|}[\frac{\partial{C_i}}{\partial{a_{ij}^{(L_{last}-1)}}} * S'(z_{ij}^{(L_{last}-1)}) * w_{xj}^{(L_{last}-1)}]$


and so long...

Q1: Are the calculations above correct?

Then I have all current values (weights and biases) let's say in some value vector $v$. Also, for each input I have a gradient vector $g_i$ corresponding to $v$ using calculations above. I suppose I create average of all $g_i$ to create final gradient vector $g$: $g = \frac1T * \sum\limits_{i=1}^{T}g_i$. (using MBGD)

To my understanding, what I have basically got is the direction of highest gain from point $v$ on the cost function $C$, so if I want to create altered value vector $v'$, I should use equation $v' = v - g$ to alter $v$ in opposite direction to minimize the difference from the expected output.

Q2: Is the gradient vector correctly used?

Also, I was wondering, Q3: does it make sense to alter my gradient a bit, for example to speed up the learning process, for example $v' = v - d*g$ where $d >= 1$, or generally use some modifying function $m$, such as $v' = v - m(g)$, or is raw gradient usually the best pick?

1

There are 1 best solutions below

2
On BEST ANSWER

Q1: Are the calculations above correct?

You're calculations seem correct, although I don't have time to meticulously make sure and understand your notation correctly, but it seems as though you correctly apply the chain rule and then when a parameter receives multiple gradients you sum them, which is the correct way to apply back prop.

Q2: Is the gradient vector correctly used?

As for how you're applying the gradient, you are partly correct. You do indeed subtract $g$ from $v$ to get $v'$ as you have done. This will decrease the cost function and make your model learn. However finding the average of the gradient is unnecessary, as by finding the average you are essentially just scaling the gradient by some factor, which you'll find is what a learning rate is. As such, it's much simpler to just have 1 single hyperparameter to control how much the gradient is scaled by, then to have the hyperparameter and also the gradient being averaged.

Q3: does it make sense to alter my gradient a bit?

Yes, it definitely does make sense to alter your gradient, but never by a factor of more than 1. Your model will not learn if you multiply the gradient by a factor greater than 1. This is because the model wont learn faster as you think it might, instead the gradient will just be wrong. An incorrect gradient will cause the error to exponentially diverge to infinity. As a general rule of thumb, a good learning rate (in your equations, a good value for $d$) would be something like 0.001 . Of course, you'll need to play around with it but that should ensure proper learning (I know it's a little surprising that the value is that low, but a low learning rate ensures your model rarely overshoots local minima).