Understanding the rewritten form of Nesterov Accelerated Gradient used to derive Nadam

219 Views Asked by At

This, other papers and this blog suggest that we can rewrite the NAG algorithm which basically does $$\theta _{t+1\:}=\theta _t\:-m_t$$ with $$m_t=\gamma \:m_{t-1}+\eta \frac{\partial L(\theta _t-\gamma \:m_{t-1})}{\partial\theta_t}$$

as $$\theta _{t+1\:}=\theta _t\:-\left(\gamma m_t\:+\eta \:\frac{\partial L\left(\theta _t\right)}{\partial \theta_t}\right)$$ with $$m_t=\gamma \:m_{t-1}+\eta \frac{\partial L(\theta _t)}{\partial\theta_t}$$ Plugging, they clearly are not the same iterative scheme. Rather, the "rewritten NAG" seems as if its just the classic SGD with momentum but with different weights for $m_{t-1}$ and $\frac{\partial L(\theta _t)}{\partial\theta_t}$. Is there any proof that these two are the same or at least do the same thing? Is there any intuition why would they be?

Thank you.

1

There are 1 best solutions below

1
On

Is there any proof that these two are the same or at least do the same thing?

Answer : YES, these two formulas are equilavent (proof below).

Rather, the "rewritten NAG" seems as if its just the classic SGD with momentum but with different weights.

Answer : NO, Nesterov is not equivalent to classic momentum with different weight.

Proof : Nesterov formula (as it was originaly written by the author) involes two sequences of parameters $(x_t)_{t=0..T}$, $(y_t)_{t=0..T}$, which are closed to each other (meaning $\|x_t-y_t\|$ goes to zero). Note : I denote the parameters $x_t$ and $y_t$ instead of $\theta_t$). Nesterov update is the following : $$ \begin{cases} (1) \quad x_{t} & = y_t - \eta \cdot L'(y_t) \\ (2) \quad m_t & = x_{t} - x_{t-1} \\ (3) \quad y_{t+1} & = x_{t} + \gamma \cdot m_t \\ \end{cases}$$ You can rewrite (3) : $$\begin{eqnarray*} y_{t+1} &=& (y_t - \eta \cdot L'(y_t)) + \gamma m_t \\ &=& y_t + \gamma m_t - \eta \cdot L'(y_t) \end{eqnarray*}$$ And you can find an update formula for $m_t$, because $$\begin{eqnarray*} m_{t+1} &=& x_{t+1} - x_{t} \\ &=& y_{t+1} - \eta \cdot L'(y_{t+1}) - x_t \\ &=& (x_t + \gamma \cdot m_t) - \eta \cdot L'(y_{t+1}) - x_t \\ &=& \gamma \cdot m_t - \eta \cdot L'(y_{t+1}) \\ \end{eqnarray*}$$

So discarding the sequence $(x_t)_t$, Nesterov is equivalent to : $$ \begin{cases} m_t & = \gamma \cdot m_{t-1} - \eta \cdot L'(y_t) \\ y_{t+1} & = t_{t} + \gamma \cdot m_t - \eta \cdot L'(y_t) \\ \end{cases}$$

On the other hand, instead of discarding $(x_t)_t$, you can discard $(y_t)_t$, and I let you show that, Nesterov is equivalent to : $$ \begin{cases} m_{t+1} & = \gamma \cdot m_t - \eta \cdot L'(x_t + \gamma \cdot m_t ) \\ x_{t} & = x_{t-1} + m_t\\ \end{cases}$$

So in fact, strictly speaking, both updates are not the same because one is the update on $x_t$ and the other is an update on $y_t$, but both sequences are binded in Nesterov's scheme.

Concerning the difference with Momentum, I have recently answered the question here https://stats.stackexchange.com/questions/179915.