I understand how gradient descent works. But I have trouble understanding why we usually use the derivative in the equation.
The equation is: new_value = old_value - learning_rate * derivative
I understand the derivative tells which direction to go. But why do we use learning_rate * derivative to find the new_value? What if I use learning_rate * a constant, e.g., 1
I mean I can do this:
If the derivative is negative, new_value = old_value plus learning_rate * 1
If the derivative is positive, new_value = old_value minus learning_rate * 1
I know this is awkward. I just want to say there might exist many ways. Why should we use learning_rate * derivative?
Your insight is highly appreciated!
You need the derivative to know "in which direction to go". For scalar fields (function $\Bbb R^n \to \Bbb R$), ie, in multiple dimensions, this derivative becomes the gradient, and it becomes a lot more important to know "in which direction to go" to reach the nearest extremum; since there's a lot more directions in which to go (an infinity) rather than just left or right like in 1D (2 directions). This is precisely what the gradient (or its opposite, the gradient scaled by $-1$) does.
Additionally, the derivative/gradient encodes a sort of measure of "how fast should I go to reach my extremum without overtaking it" (it's an additional weight to smooth out the gradient descent process, more dynamic than the learning rate).