Gradient-descent algorithm always converges to the closest local optima?

6.8k Views Asked by At

Assume $f(\vec x)$, which is Lipschitz continuous, has two local optima $\vec x_1^*$ and $\vec x_2^*$( $\vec x_1^*$ is the global minimum). We start the gradient-descent algorithm from $\vec x_0$ and $\left \|\vec x_0 - \vec x_1^* \right \| \leq \left \|\vec x_0 - \vec x_2^* \right \|$. Additionally, we know between $\vec x_0$ and $\vec x_1^*$, $f(\vec x)$ is convex, but between $x_1^*$ and $x_2^*$, $f(\vec x)$ is not convex.

Can we say initialized at $\vec x_0$ using gradient-descent algorithm and it always converge to $\vec x_1^*$ rather than $\vec x_2^*$? Is there any THM to support this? Is there any other way to do this?

2

There are 2 best solutions below

2
On BEST ANSWER

No, there is no guarantee that you will converge to $x_1$. This also depends on the choice of your step size. Let us recall that the gradient descent algorithm is defined by $$ \vec x^{k+1} = \vec x^k -\gamma_k \nabla f(\vec x^k)$$ where $\gamma_k$ is the step size. First let us show that if you choose a constant step size then you can always find a counter example. So suppose $\gamma_k = \gamma$ for every $k$. Now consider the function $$f(x)= \left\{ \begin{array}{ll} \frac{2}{\gamma}x^2 & \text{if }x \leq 1\\  \frac{2}{\gamma} &\text{else } \end{array}\right. \quad \text{ then }\quad f'(x)= \left\{ \begin{array}{ll} \frac{4}{\gamma}x & \text{if }x \leq 1 \\ 0 &\text{else } \end{array}\right.$$ and the points $x_0=-\frac{2}{3} ,x_1=0, x_2 = 2$ (so that $|x_0-x_1| < |x_0-x_2|$). In particular the first step of the Gradient descent would be $$ x_0-\gamma f'(x_0) = -\frac{2}{3} + \frac{4}{\gamma}\gamma \frac{2}{3} = 2 = x_2. $$ And it will get stuck on $x_2$ since $f'(x_2)=0$ so the next step would be $x_2 -\gamma f'(x_2) = x_2$, and so on. In the picture below you can find the plot of $f$ for $\gamma = 2$.

example of $f$ with $\gamma = 2$ Now one could argue that the choice of a constant step size is a bad choice and setting $$\gamma_k = \arg\min_{\gamma \geq 0} f(\vec x^k-\gamma \nabla f(\vec x^k)),\qquad (*)$$ would be better (i.e. we choose the step to optimize the descent). However this is not always true! It is shown in the lecture notes of Convex Optimization of M. Hein that the function $$ f(x,y) = \left\{ \begin{array}{ll} 5\sqrt{9x^2+16y^2} & \text{if }x >|y| \\ 9x+16|y| &\text{else } \end{array}\right.$$ is pathological enough so that if you start the Gradient Descent algorithm with steepest descent (i.e. $\gamma^k$ fulfills $(*)$ at each step) with $(1,\left(\frac{10}{16}\right)^2)$ then the method will converge to $(0,0)$ which is not even a local minima. While the choice of a constant step size $\gamma $ small enough will (slowly) tend to $(-\infty, 0)$.

Note: Very complete informations about the Gradient Descent algorithm (and its "applications limit") can be found in the book Convex Optimization by S. Boyd and L. Vandenberghe.

3
On

The distance between pairs of points doesn't have anything to do locally with the directional derivative of the function as it varies in between trajectories from $x_0$ to the two points $x_1,x_2$. You could have $||x_0 - x_1|| < ||x_0 - x_2||$ and yet the local gradient at $x_0$ says to go toward point $x_2$ instead of $x_1$. And convexity/lack of convexity in the middle range between $x_0$ and $x_1,x_2$ has nothing to do with it either, because gradient descent is a local method that just relies on gradients and not second derivatives. You could very well converge to $x_2$ with gradient descent. Maybe not with more sophisticated methods.