I am new to neural networks and recently found out about gradient descent.
Something does not sit right with me.
x←x−λ∇fk(x)
Why does this formula work? Wouldn't it make more sense to have lambda a large value thereby mimizing the cost function?
I am not phrasing my question properly as i am honestly quite confused. How could gradient descent result in a global optimum if it always reduces the value?
Let me explain you clearly:
Learning rate is the length of the steps the algorithm makes down the gradient on the error curve.
So, in case you have a high learning rate, the algorithm might overshoot the optimal point.
And with a lower learning rate, in case of any overshoot, the magnitude of overshoot would be lesser than when you have a higher learning rate.
So, in case of overshoot, you would end up at a non-optimal point whose error would be higher.