Faster gradient descent convergence by transforming the gradient?

865 Views Asked by At

If we modify the gradient descent update for a convex objective function $f(\boldsymbol{\theta})$ from $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \nabla f(\boldsymbol{\theta}_t)$ to $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - g(\nabla f(\boldsymbol{\theta}_t))$, namely we transform the gradient by a function $g$, how should we choose a $g$ so that the convergence rate is faster than the original gradient descent?

More specifically, by running $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - g(\nabla f(\boldsymbol{\theta}_t))$ we are indeed optimizing another surrogate function $\tilde{f}$ whose gradient is $g(\nabla f(\boldsymbol{\theta}_t))$, and we need $g(\boldsymbol{0})=\boldsymbol{0}$ so that $f$ and $\tilde{f}$ have the same minimum. We also need $g$ to be a non-decreasing function so that $\tilde{f}$ is still convex. But other than that, how should we choose a proper $g$ so that by optimizing $\tilde{f}$ we reach the optimum faster than optimizing $f$? Shall we choose a $g$ so that $\tilde{f}$ has a smaller Lipschitz constant $L$ and larger strongly-convex constant $\mu$ (since these two quantities determines the convergence rate of gradient descent)?

Thanks!

1

There are 1 best solutions below

3
On BEST ANSWER

I believe that preconditioning already gives such function $g$. In this case $g$ is simply multiplication by a matrix. But I don't know of any known acceleration method with non-linear $g$.

I assume that you mean that $g$ does not change as the iterations progress. That is, the function $g$ is the same for all iterates. Otherwise, Newton's method is an example for such a method that substantially accelerates convergence.