Gradient descent method with random perturbation

433 Views Asked by At

Suppose there is a function $f:\mathbb R^n \to \mathbb R$. One way to find a stationary value is to solve the ODE $\dot x = - \nabla f(x)$, and look at $\lim_{t\to\infty} x(t)$.

However I want to consider a variation of this method where we solve $$ dx = - \nabla f(x) dt + C(t,x) \cdot dW_t ,$$ where $C(x,t) \in \mathbb R^{n\times n}$, and $W_t$ is $n$-dimensional Wiener process, with some kind of condition like $C(x,t) \to 0$ as $t\to\infty$. The hope is that it might converge to a stationary value faster, and also that the stationary value it converges to will be a local minimum.

Can anyone give me some resources for where I could read about this sort of thing? Using a google search, and following references, I did find the book Random perturbations of dynamical systems by Mark I. Freidlin and Alexander D. Wentzell, but I didn't find "gradient descent" in the index.

2

There are 2 best solutions below

2
On BEST ANSWER

Freidlin & Wentzell and its community is interested in a set of topics a little bit different from yours (metastability and exit problems). Your kind of cases have been studied, see for example http://repository.ias.ac.in/1132/1/323.pdf and its reference for further information.

In a nutshell, the sde you describe, under general conditions for $c$ and in the case $c$ does not depend in time, has an invariant measure proportional to $e^{ - f(x)/c }$ ( I might be missing some constant in the exponent), so as $c$ tends to zero, the diffusion converges to one of the global minimums of the function f.

Now, for the case you are interested in, where $c$ is time dependent, it is expected that the same behavior holds if $c(t,x) \to 0, \text{ as } t \to \infty$. This was studied for the case $c(t) = 1/( c_0 \log t )$, and the same result follows for $c_0$ large enough, this process is called the annealing process. The last result I have seen in this direction is given here http://projecteuclid.org/euclid.aoap/1043862427. I recommend you to read the review paper http://onlinelibrary.wiley.com/store/10.1002/wcms.31/asset/31_ftp.pdf;jsessionid=4CD5D404F2C4C556D5C3F4B6331C71CA.f01t02?v=1&t=idf23s2i&s=61e6eec769b556097ae3239bc68798fa9231d8c1 for a related setting that might interest you.

0
On

Not directly your question, but Wales, Energy landscapes (2003, 681p) might be interesting. That leads to "basin-hopping", a practical stochastic algorithm for minimizing e.g. configurations of molecules. scipy basinhopping says

The acceptance test used here is the Metropolis criterion of standard Monte Carlo algorithms, although there are many other possibilities

(By the way, for accelerated gradient descent in general see Nesterov momentum, Adagrad, Adadelta ... more methods than test cases.)