The update rule in Newton's method is $ x \leftarrow x - \mathbf{H}^{-1} \nabla $, where $ \mathbf{H} $ is the Hessian and $ \nabla $ is the gradient.
In the case of gradient descent, the update rule is $ x \leftarrow - \alpha \nabla $, where $ \alpha $ is the step size or learning rate.
Created this visual (below) for understanding the impact of the learning rate of gradient descent. The orange circle indicates the initial value of $x$. The 'Insights' table shows the first 10 steps of gradient descent for the chosen step size.
For this particular problem, Newton's method suggests a step size (reciprocal of second-derivative) of 0.5. Some observations from this interactive visualization:
- A step size of 0.5 with gradient descent will reach the minimum in one step. (Just like Newton's method would for a convex quadratic function).
- Any learning rate lower than 0.5 leads to slower convergence.
- Any learning rate higher than 0.5 (but less than 1) leads to oscillations around the minimum till it finally lands at the minimum.
- A learning rate $\alpha=1$ for this function. The iterations just keep oscillating.
- A learning rate greater than 1 leads to the iterations moving away from the minimum.
Now the question is: "1" is twice the step size of Newton's method for this optimization problem. Will it be the case that a learning rate that is twice (or some other finite multiple) of the reciprocal of the second-order derivative is undesirable for optimization with gradient descent? Is anyone familiar with any research along these lines?