Momentum, Acceleration, Over-relaxation

164 Views Asked by At

I commonly hear the terms "momentum", "acceleration", and "over-relaxation" when discussing optimization. At the moment, though, it seems to me that which word is used to describe a formula is arbitrary. I'm sure that's not the case, so I'm asking a question here.

For example, FISTA is often described as the Proximal Gradient method with acceleration. Why isn't it over-relaxation? One can over-relax the projection onto convex sets algorithm; why isn't it acceleration?

Can someone provide me with rigorous definitions for these three terms? I would very much like it if the definition of momentum conformed to the definition employed in the distill article here: https://distill.pub/2017/momentum/. (Though, of course, that isn't necessary.)

Thank you all for your help.

1

There are 1 best solutions below

0
On BEST ANSWER

These terms are used in informal ways by various authors, so it's difficult to give formal definitions. However, we can discuss how the terms are commonly used to describe various kinds of algorithms.

The term "over relaxation" comes from the method of Successive Over-Relaxation (SOR) for iteratively solving a linear system of equations. In SOR, a step is taken in the same direction as in the earlier Gauss-Seidel method (which is one of a class called Relaxation Methods.) However, this step is multiplied by a scaling factor $\omega$, which might be larger than one. SOR with $\omega=1$ is simply the Gauss-Seidel method.

Compared with conventional (sub)gradient descent methods, momentum methods use a search direction based on the current gradient in linear combination with one or more previous search directions. This changes the search direction, so it is not the same thing as over-relaxation (which changes how far we go in the search direction.)

The term "accelerated" is used to describe first-order optimization methods that have better performance than conventional (sub)gradient descent methods. For example, if a conventional gradient descent method requires $O(1/\epsilon)$ iterations to achieve an $\epsilon$ approximate solution for a class of problems, an accelerated method might require $O(1/\sqrt{\epsilon})$ iterations to obtain an $\epsilon$ approximate solution. Most of the accelerated first-order methods use the momentum idea.