I've found a variety of variations of writing Nesterov but I cannot understand why they cannot simply be expanded into a one liner.
Here is one I found that can just be re-arranged, can someone explain why I am wrong?
$\theta_t = y_t - \gamma \nabla f(y_t) \\ y_{t+1} = \theta_t + \rho (\theta_t - \theta_{t-1})$
Plug first equation into second,
$y_{t+1} = y_t - \gamma \nabla f(y_t) + \rho (\theta_t - \theta_{t-1})$
Let $\Delta y_t = y_{t+1} - y_{t}$ then it simply becomes
$$\Delta y_t = - \gamma \nabla f(y_t) + \rho (y_t - \gamma \nabla f(y_t) - y_{t-1} + \gamma \nabla f(y_{t-1}) \\ = - \gamma \nabla f(y_t) + \rho (\Delta y_{t-1} + \gamma (\nabla f(y_{t-1}) - \nabla f(y_{t})) $$
so what am I doing wrong?
In fact I've found a similar form in a paper: Ning Qian. On the momentum term in gradient descentlearning algorithms.Neural Networks, 12(1):145 – 151,1999.
where gradient descent with momentum is defined as
$$\Delta \theta_t = - \gamma \nabla f(\theta) + \rho \Delta \theta_{t-1} $$ (I'm also not sure why it's $f(\theta)$ and not $f(\theta_t)$)