This, other papers and this blog suggest that we can rewrite the NAG algorithm which basically does $$\theta _{t+1\:}=\theta _t\:-m_t$$ with $$m_t=\gamma \:m_{t-1}+\eta \frac{\partial L(\theta _t-\gamma \:m_{t-1})}{\partial\theta_t}$$
as $$\theta _{t+1\:}=\theta _t\:-\left(\gamma m_t\:+\eta \:\frac{\partial L\left(\theta _t\right)}{\partial \theta_t}\right)$$ with $$m_t=\gamma \:m_{t-1}+\eta \frac{\partial L(\theta _t)}{\partial\theta_t}$$ Plugging, they clearly are not the same iterative scheme. Rather, the "rewritten NAG" seems as if its just the classic SGD with momentum but with different weights for $m_{t-1}$ and $\frac{\partial L(\theta _t)}{\partial\theta_t}$. Is there any proof that these two are the same or at least do the same thing? Is there any intuition why would they be?
Thank you.
Answer : YES, these two formulas are equilavent (proof below).
Answer : NO, Nesterov is not equivalent to classic momentum with different weight.
Proof : Nesterov formula (as it was originaly written by the author) involes two sequences of parameters $(x_t)_{t=0..T}$, $(y_t)_{t=0..T}$, which are closed to each other (meaning $\|x_t-y_t\|$ goes to zero). Note : I denote the parameters $x_t$ and $y_t$ instead of $\theta_t$). Nesterov update is the following : $$ \begin{cases} (1) \quad x_{t} & = y_t - \eta \cdot L'(y_t) \\ (2) \quad m_t & = x_{t} - x_{t-1} \\ (3) \quad y_{t+1} & = x_{t} + \gamma \cdot m_t \\ \end{cases}$$ You can rewrite (3) : $$\begin{eqnarray*} y_{t+1} &=& (y_t - \eta \cdot L'(y_t)) + \gamma m_t \\ &=& y_t + \gamma m_t - \eta \cdot L'(y_t) \end{eqnarray*}$$ And you can find an update formula for $m_t$, because $$\begin{eqnarray*} m_{t+1} &=& x_{t+1} - x_{t} \\ &=& y_{t+1} - \eta \cdot L'(y_{t+1}) - x_t \\ &=& (x_t + \gamma \cdot m_t) - \eta \cdot L'(y_{t+1}) - x_t \\ &=& \gamma \cdot m_t - \eta \cdot L'(y_{t+1}) \\ \end{eqnarray*}$$
So discarding the sequence $(x_t)_t$, Nesterov is equivalent to : $$ \begin{cases} m_t & = \gamma \cdot m_{t-1} - \eta \cdot L'(y_t) \\ y_{t+1} & = t_{t} + \gamma \cdot m_t - \eta \cdot L'(y_t) \\ \end{cases}$$
On the other hand, instead of discarding $(x_t)_t$, you can discard $(y_t)_t$, and I let you show that, Nesterov is equivalent to : $$ \begin{cases} m_{t+1} & = \gamma \cdot m_t - \eta \cdot L'(x_t + \gamma \cdot m_t ) \\ x_{t} & = x_{t-1} + m_t\\ \end{cases}$$
So in fact, strictly speaking, both updates are not the same because one is the update on $x_t$ and the other is an update on $y_t$, but both sequences are binded in Nesterov's scheme.
Concerning the difference with Momentum, I have recently answered the question here https://stats.stackexchange.com/questions/179915.