Stopping criteria for gradient method

9.9k Views Asked by At

For numerically solving a smooth convex optimization $\min\{f(x): x\in S\}$ where $S$ is a closed convex set, we can apply some different algorithms: gradient method, accelerated gradient, proximal gradient ... depending on the structure of the problem. Solving is to find a solution $x*$ such that $f(x^*)=\inf\{f(x): x\in S\}:=f^*$. To this end, we try to construct an iterative sequence $\{x^k\}$ that converges to some solution $x^*$, or the sequence of numbers $\{f(x^k)\}$ tends to $f^*$. Note that, if $x^k\to x^*$ then the continuity of $f$ can ensures that $f(x^k) \to f^*$.

My questions are:

  1. In which cases we should focus on the convergence of $\{f(x^k)\}$ rather than of $\{x^k\}$? Is finding a point $x^K$ such that $x^K$ close enough to a solution $x^*$ better than finding a point $x^K$ such that $f(x^K)$ close enough to $f^*$?

  2. What is the best stopping criteria for an algorithm? I know the following ways:

    • Determine the number of iterations we need to perform to achieve a desired error $\epsilon$, i.e., $||x^k-x^*||<\epsilon$ or $|f(x^k)-f^*|<\epsilon$ implies $k\geq N$ for some $N$. I see that this way is very reliable.

    • terminating when $||x^{k+1}-x^k||$ or $|f(x^{k+1})-f(x^k)|$ is small enough.

    • terminating when $||\nabla f(x^k)||$ is small enough.

Could you explain how the second and the third cases work? Why $||\nabla f(x^k)||$ small enough can implies that $f(x^k)$ is approximate the optimal value $f^*$. I have been know that the case $f$ is strongly convex this can be verified. Is this stopping criteria still reliable in the case where $f$ is not strongly convex?

1

There are 1 best solutions below

5
On

If $f$ is strictly convex, it has at most one minimum and at this minimum its gradient is zero. So the third criteria should work fine. If $f$ is not convex, you may reach a local minimum so this criterion is not really justified. If you are using a gradient method, the second criteria is very similar to the third because each step (or the difference $x^{k+1}-x^{k}$) is obtained from the gradient.