Confused about Nesterov momentum gradient descent algorithm

57 Views Asked by Bumbble Comm At 28 Mar 2026 - 8:36

I've found a variety of variations of writing Nesterov but I cannot understand why they cannot simply be expanded into a one liner.

Here is one I found that can just be re-arranged, can someone explain why I am wrong?

$\theta_t = y_t - \gamma \nabla f(y_t) \\ y_{t+1} = \theta_t + \rho (\theta_t - \theta_{t-1})$

Plug first equation into second,

$y_{t+1} = y_t - \gamma \nabla f(y_t) + \rho (\theta_t - \theta_{t-1})$

Let $\Delta y_t = y_{t+1} - y_{t}$ then it simply becomes

$$\Delta y_t = - \gamma \nabla f(y_t) + \rho (y_t - \gamma \nabla f(y_t) - y_{t-1} + \gamma \nabla f(y_{t-1}) \\ = - \gamma \nabla f(y_t) + \rho (\Delta y_{t-1} + \gamma (\nabla f(y_{t-1}) - \nabla f(y_{t})) $$

so what am I doing wrong?

In fact I've found a similar form in a paper: Ning Qian. On the momentum term in gradient descentlearning algorithms.Neural Networks, 12(1):145 – 151,1999.

where gradient descent with momentum is defined as

$$\Delta \theta_t = - \gamma \nabla f(\theta) + \rho \Delta \theta_{t-1} $$ (I'm also not sure why it's $f(\theta)$ and not $f(\theta_t)$)

Original Q&A

Confused about Nesterov momentum gradient descent algorithm

Related Questions in OPTIMIZATION

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions