Gradient flow of Nesterov accelerated gradient methods

51 Views Asked by At

I am reading a nice paper [1] that gives a differential equation for NAG methods. The updating rules of NAG are:

$$x_k = y_{k-1} - \eta \nabla f(y_{k-1}) \tag{1}$$ $$y_k = x_k + \frac{k-1}{k+2}(x_k - x_{k-1}) \tag{2}$$

Based on my comprehension, the basic intuition is that momentum is used to accelerate the convergence. Moreover, the gradient is computed at $y_{k-1}$ instead of $x_{k-1}$, where $y_{k-1}$ is an estimated position at the next iteration.

In the paper, the authors prove that NAG converges to the ODE below, when $\eta \to 0$,

$$\ddot X(t) + \frac{3}{t}\dot X(t) + \nabla f(X) = 0 \tag{3}$$

In their (informal) derivation of section 2, $\nabla f(y_k)$ is nicely estimated by $\nabla f(X(t))+ o(1)$, and the last term is omitted as expected.

However, as far as I see, this means that replacing $\nabla f(y_{k-1})$ by $\nabla f(x_{k-1})$ in the updating rules (Eq. (1)) yields the exact same ODE. Does this mean replacing $\nabla f(y_{k-1})$ by $\nabla f(x_{k-1})$ yields the same $O(1/T^2)$ convergence rate for convex, smooth functions?

[1] Su, Weijie; Boyd, Stephen; Candès, Emmanuel J., A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, J. Mach. Learn. Res. 17, Paper No. 153, 43 p. (2016). ZBL1391.90667.