why use a small learning rate in gradient descent

3.3k Views Asked by At

I am new to neural networks and recently found out about gradient descent.

Something does not sit right with me.

x←x−λ∇fk(x)

Why does this formula work? Wouldn't it make more sense to have lambda a large value thereby mimizing the cost function?

I am not phrasing my question properly as i am honestly quite confused. How could gradient descent result in a global optimum if it always reduces the value?

1

There are 1 best solutions below

2
On BEST ANSWER

Let me explain you clearly:

Learning rate is the length of the steps the algorithm makes down the gradient on the error curve.

So, in case you have a high learning rate, the algorithm might overshoot the optimal point.

And with a lower learning rate, in case of any overshoot, the magnitude of overshoot would be lesser than when you have a higher learning rate.

So, in case of overshoot, you would end up at a non-optimal point whose error would be higher.