Why does gradient descent zigzag and takes long time on the Rosenbrock function

1k Views Asked by At

So today in class we discussed how sometimes even for smooth functions that have nice analytical properties gradient descent takes a long time.

$F(x,y) = 1(x - 1)^2 + 100(y-x^2)^2$

I understand that the the disparity between the $1(x - 1)^2$ and $100(y-x^2)^2$ creates some sort of a scaling problem that makes the method zigzag. My problem is I don't understand why do we get this zigzaging, and how do we see it analytically. I went over the proof of the rate of convergence of gradient descent https://www.cs.cmu.edu/~ggordon/10725-F12/scribes/10725_Lecture5.pdf in this document and I don't see how it is affected by the particular form of $f$. Thank you very much for your time.

1

There are 1 best solutions below

0
On BEST ANSWER

In the case of the Rosenbrock function, there is a valley that lies approximately along the curve $y=x^2$. If you start gradient descent from a point in the valley, the gradient points roughly along the curve $y=x^2$ and moves towards the minimum of the function, although with very small steps because the gradient is small here.

For the analysis, notice that the Rosenbrock function has compact level sets. Although the Rosenbrock function is not Lipschitz on all of $\mathbb{R}^2$, it is Lipschitz on the sets $\{(x,y):f(x,y)\leq c\}$, where $c\geq0$. Therefore, we can take step sizes small enough to ensure that gradient descent converges to a point in these level sets and the iteration stays in these sets.

Now suppose that you start on one of the sides of this valley. The previous argument says that the iterations will either go to the bottom of the valley or go to the side of the valley at a slightly lower point. Because of the scaling property you mentioned, if you are on the side of the valley, your gradient will likely be very large and will point almost directly towards the valley. But since the gradient is large, it overshoots, so we end up on the other side of the valley, and we repeat. This is happening because the direction of the gradient is dominated by the gradient of $100(y-x^2)^2$, so a fixed step size (and even a basic line search in this case) will almost always result in this zigzagging. This can be avoided if the step size is small enough that the step doesn't cross the valley, then the iterations will converge to the valley and never cross.