Why to use learning_rate * derivative in gradient descent instead of learning_rate * constant

186 Views Asked by At

I understand how gradient descent works. But I have trouble understanding why we usually use the derivative in the equation.

The equation is: new_value = old_value - learning_rate * derivative

I understand the derivative tells which direction to go. But why do we use learning_rate * derivative to find the new_value? What if I use learning_rate * a constant, e.g., 1

I mean I can do this:

If the derivative is negative, new_value = old_value plus learning_rate * 1

If the derivative is positive, new_value = old_value minus learning_rate * 1

I know this is awkward. I just want to say there might exist many ways. Why should we use learning_rate * derivative?

Your insight is highly appreciated!

3

There are 3 best solutions below

0
On

You need the derivative to know "in which direction to go". For scalar fields (function $\Bbb R^n \to \Bbb R$), ie, in multiple dimensions, this derivative becomes the gradient, and it becomes a lot more important to know "in which direction to go" to reach the nearest extremum; since there's a lot more directions in which to go (an infinity) rather than just left or right like in 1D (2 directions). This is precisely what the gradient (or its opposite, the gradient scaled by $-1$) does.

Additionally, the derivative/gradient encodes a sort of measure of "how fast should I go to reach my extremum without overtaking it" (it's an additional weight to smooth out the gradient descent process, more dynamic than the learning rate).

0
On

That would be horribly ineffective. The derivative not only tells us whether the weight should be increased or decreased, it also tells us by how much. So say the optimum of a weight (in a network) is 1.32. If it is say initialised at 5 (randomly), your algorithm would get to 2 and then instead of hitting 1.32 it would go down to 1. In the next training example your algorithm would increase the weight by 1 to 2. So the weight just keeps alternating between 1 and 2, never reaching the optimum value.

Now if you had used the derivative instead, as the weight approached 1.32,
$lr\cdot \frac{\partial }{\partial x}\left(w\right)$ would keep getting smaller. In other words as you approach the optimum your step would get smaller and smaller, getting closer and closer to 1.32, and thus your network would perform way better

0
On

When you are far away from the minimum you want to decrease rapidly, and when you are close to the minimum you want to decrease less rapidly to avoid overshooting. This is not possible if you just add or subtract a constant.

For more information, look up momentum gradient descent, the Armijo rule, and other variants of gradient descent. All of them fine tune the learning rate so that you do not simply add or subtract a constant every time.

In the multidimensional case you would be asking why not add a vector such as $(1, -1, 1, 1, -1)$. The answer is the same.