why gradient descent does not always land at the global minimum closest to the starting point?

330 Views Asked by At

I am given this function $\boxed{f(x,y)=((x^2+y^2)-1)^2}$. I need to do gradient descent analysis on it.

I have studied that it's not trivial to show mathematically "ball reaches to the global minima closest to the starting point".

I want to explore this over my function..

I have computed, $$\nabla f=\left( \frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\right)$$ $$ \frac{\partial f}{\partial x} = 4x\left(x^2+y^2-1\right) $$ $$ \frac{\partial f}{\partial y} = 4y\left(x^2+y^2-1\right) $$

Next What should I do. Can anyone make me understand why gradient descent does not always land at the global minimum closest to the starting point in this function's context ?

1

There are 1 best solutions below

2
On BEST ANSWER

In general, gradient descent is not guaranteed to reach even the closest local minimum. This is because the function may have saddle points to get stuck in or run off to infinity (there is a local minimum, but no global minimum). If you want to understand these "failure" cases, I'd suggest you pick one of those functions. It is generally not well understood why, in practice, gradient descent often performs well even on non-convex functions.

If you want to understand the theory behind the guarantees that do exist, you should look into convex, strongly convex and smooth (Lipschitz-continuous gradients) functions.

For non-convex functions such as yours there is trajectory analysis, which - depending on the initial point and step size - can show that a function has certain "nice" properties along the trajectory of gradient descent. Using these properties one is sometimes able to show convergence to local or global minima, even convergence rate guarantees.

In the case of your function, however, one can use some basic trajectory analysis to show that, using a suitable stepsize, we will reach the closest global minimum, if the initial point $x_0 \neq \bf{0}$. The key here is to exploit the fact that $f$ behaves like $(z^2-1)^2$ along any direction $z \in \mathbb{R}^2$. It is easy to see that there exists some direction $z$ with $x_0=\lambda z, \, \lambda > 0$ and any (small enough) step along the gradient will again yield an iterate $x_t\in\{\lambda z \,|\,\lambda > 0 \}$. This is all for the trajectory analysis part and it shows we can reduce to the case of $f(x)=(x^2-1)^2,\, x>0$. $f$ is convex on $[\frac 1 2,\infty)$ so for the iterates from the left you also need to show that for any strictly positive step size you get into this range eventually and you can reduce to this easier case: You should be able to show that for some $\mu > 0, L > 0$, the function is $\mu$-strongly convex and $L$-smooth on $[\frac 1 2,x_0]$. It is then well known that gradient descent with stepsize $\gamma=\frac 1 L$ will converge to the global minimum.

Note that we cheated a bit here since you generally cannot choose $L$ quite like this without knowing the function analytically. This weakens our statement a bit. There is probably also a much simpler way that does not involve all this existing theory, but it is probably necessary to know it anyway, given that OP asked for a very deep understanding.