Escape from local minima in the gradient descent method

157 Views Asked by At

I'm using gradient descent $$x_i=x_{i-1}-\gamma\nabla f(x_{i-1})\tag1$$ to minimize a function $f$. Now, I've observed that I got stuck in a local minimum when $x_0$ was chosen in an unlucky way.

Is there any mechanism to detect that we've got stuck in a local minimum and escape it in some way?

I've come up with a simple idea, which actually works, but I guess it is quite inefficient. What I'm doing is replacing $(1)$ by $$x_i=x_{i-1}-\gamma_i\frac{\nabla f(x_{i-1})}{\left\|\nabla f(x_{i-1})\right\|}\tag2,$$ where $\gamma_i=\gamma_{i-1}/2$ and $\gamma_0=1$. Now, after every iteration $i$, I choose randomly a $\tilde\gamma$ in $[0,1)$, compute $$\tilde x_i=x_i-\tilde\gamma\frac{\nabla f(x_i)}{\left\|\nabla f(x_i)\right\|}\tag3$$ and if $f(\tilde x_i)<f(x_i)$, then I set $\gamma_i=\tilde\gamma$ and $x_i=\tilde x_i$.

I'm sure there is a smarter way.

2

There are 2 best solutions below

3
On BEST ANSWER

There are plenty of solutions to this very problem being developed because of the usage of gradient descent methods to optimize neural networks.

Depending on the dimension of your state space $x$ there are smarter or more straight forward ways to do it.

If the dimension of $x$ is low, you can do the brute-force search on the grid surrounding the point you are getting stuck around. How big and dense the grid has to be is dependent on the regularity of the function $f$.

If the dimension is greater than 1 you probably also want to escape the saddle points therefore you should analyse the eigenvalues of a Hessian matrix of $f$ nearby the point you are being stuck.

That being said, if the dimensionality of your problem is low, there are more theoretically sound, higher than order 1 optimization techniques which have some theoretical guarantees on convergence and work better than plain gradient descent.

If the dimensionality of your problem is high, you cannot afford the computation of Hessian due to its quadratic complexity. In this case we want to restrict ourselves to first order optimization techniques. The most popular choices here are SGD, Adam and RAdam. If you are concerned about saddle points you can add noise at each iteration and you can do it smartly as presented in the works of M. Jordan.

As a rule of thumb, if you are looking for good enough solution to your problem and the dimensionality of $x$ is absurdly high, then go and read up on some Deep Learning literature. If you want any theoretical guarantees of your optimization algorithm and the dimensionality of $x$ is still pretty high then read the Physics literature. If you have no margin of error and the dimension of $x$ is low then read the Mathematical literature.

The topic is vast and I barely scratched the surface here. I hope you can research the best suited solution for your problem yourself now, because stated as is there is no good answer really.

0
On

Although I'm not very qualified in this, I believe I have enough experience with numerical optimization to make the following statement, which you might find helpful.

Is there any mechanism to detect that we've got stuck in a local minimum and escape it in some way?

Firstly, there are conditions such as oscillating $x_{i}$ to determine when you are stuck.

However, more importantly, in the event where you are stuck before "jumping" to the next valley is to make sure that you are currently stuck in a local min valley instead of a global min valley.

Currently, there are no known algorithm that guarantees the current valley is global min.

That being said, if you know the domain of $x_{i}$, as Rodrigo de Azevedo mentioned above, you can sample a grid, within the domain, to seed the initial guesses and see which initial guess converges to the minimum value.

But, I disagree with Rodrigo de Azevede when mentioning the grid being uniform. In fact, you can make the samples to be non-uniform. This is kind of an adaptive particle swarm method.

More efficient/smarter way when getting stuck:

As Jan have mentioned the computation of Hessian, I will give an alternative.

Instead of randomly choose different step size, $\gamma$, it is often better to choose different step direction. I.e. not the steepest decent.

If you have a dictionary of the values of all the previous guesses, you can go back to the most recent uphill and set the direction to be steepest ascend (the opposite) instead and use that as the seed of your next iteration.

This is often better than choosing different step size, especially when the current valley you are stuck in is not the local min.