Gradients "backward flow" calculation rulls

125 Views Asked by At

I am reading article Hacker's guide to Neural Networks. . Becoming Becoming a Backprop Ninja section

The quote:

lets just use variables such as a,b,c,x, and refer to their gradients as da,db,dc,dx respectively. Again, we think of the variables as the “forward flow” and their gradients as “backward flow” along every wire. Our first example was the * gate:

x = a * b;
// and given gradient on x (dx), we saw that in backprop we would compute:
da = b * dx;

I have rewatched Calculus lectures about derivatives and gradient but still don't understand this simple part.

My reasonings:

The partial derivative of x with respect to a is equal to b.

dx/da = b (1)

During gradient based optimisation we have dx - that is x gradient(vector that points to the maximum and is the sum of partial derivatives vectors). This gradient tells how mutch should we adjust independent variables (that is b here) to get to the maximum.

From (1):

da = dx/b

So here we need to take gradient of x that came from cost function and divide it into b. Then we can adjust b in that direction.

So why does the author say about da = b * dx; and not da = dx/b here? Also if you see some mistakes in my reasoning please tell me.

1

There are 1 best solutions below

0
On BEST ANSWER

See reverse mode of automatic differentiation for a systematic view.

Think of it as $x$ is the input of another function $f$, $y=f(x)=f(a·b)$ and you want to compute the differential of the combined expression relative to the inputs (also called sensitivities). Then by the chain rule $$ \frac{∂}{∂a}f(ab)=\frac{∂}{∂x}f(ab)·\frac{∂}{∂a}(ab)=f'(x)·b $$ That is, the sensitivity $\bar a=\frac{∂f}{∂a}$ is the product of the sensitivity of $x$, $\bar x=\frac{∂f}{∂x}$ times $b$, $$\bar a=b·\bar x.$$

It is better to not call the gradients the same as differentials. One convention is to name them ba,bx,.... Then one has

bx*dx=ba*da

if b is constant, or

bx*dx=ba*da+bb*db

if b is also variable. Inserting the product rule for the differentials on the left gives

bx*(da*b+a*db) = ba*da+bb*db

and by comparing coefficients of da and db one finds again ba=bx*b and bb=bx*a.