why use a small learning rate in gradient descent

3.4k Views Asked by aceminer At 06 Jun 2025 - 5:56

I am new to neural networks and recently found out about gradient descent.

Something does not sit right with me.

x←x−λ∇fk(x)

Why does this formula work? Wouldn't it make more sense to have lambda a large value thereby mimizing the cost function?

I am not phrasing my question properly as i am honestly quite confused. How could gradient descent result in a global optimum if it always reduces the value?

Original Q&A

There are 1 best solutions below

Dawny33 On 27 Nov 2015 - 5:52 BEST ANSWER

Let me explain you clearly:

Learning rate is the length of the steps the algorithm makes down the gradient on the error curve.

So, in case you have a high learning rate, the algorithm might overshoot the optimal point.

And with a lower learning rate, in case of any overshoot, the magnitude of overshoot would be lesser than when you have a higher learning rate.

So, in case of overshoot, you would end up at a non-optimal point whose error would be higher.

why use a small learning rate in gradient descent

There are 1 best solutions below

Related Questions in OPTIMIZATION

Related Questions in DATA-MINING

Related Questions in DESCENT

Trending Questions

Popular # Hahtags

Popular Questions