The upper bound of the effective stepsize of the Adam optimizer

666 Views Asked by At

In the paper of the Adam Optimizer, the author states in the section 2.1 that the effective stepsize has two upper bounds: $\alpha \cdot (1- \beta_1) \ / \sqrt{1 - \beta_2}$ in the case $1 - \beta_1 > \sqrt{1 - \beta_2}$, and $\alpha$, otherswise. So the question is how can we prove this?

1

There are 1 best solutions below

1
On

Glad someone is looking at this too. First we can write the $\hat{m}$ and $\hat{v}$ as follows:

$$ \hat{m}_t = \frac{1 - \beta_1}{1 - \beta_1^t} \sum_{j=1}^t \beta_1^{t - j} g_j.$$ $$ \hat{v}_t = \frac{1 - \beta_2}{1 - \beta_2^t} \sum_{j=1}^t \beta_2^{t - j} g_j^2. $$

if $1 - \beta_1 > \sqrt{1-\beta_2}$, then $ 1 - 2 \beta_1 + \beta_1^2 > 1 - \beta_2 \Rightarrow \beta_2 > 2 \beta_1 - \beta_1^2$, from which $\beta_1 < \beta_2$, since $\beta_i \in [0, 1]$.

To get the first bound, we just need to show $$\frac{ \sum_{j=1}^t \beta_1^{t - j} g_j}{1 - \beta_1^t} \leq \sqrt{\frac{ \sum_{j=1}^t \beta_2^{t - j} g_j^2}{1 - \beta_2^t}}.$$ Without the square root on the right hand side, the inequality above is clear from the relation $\beta_1 < \beta_2$. With the square root, if the right hand side is $\leq 1$, it would also be clear. But we can simply scale down the $g_j$'s so that the right hand side is indeed $\leq 1$ without changing the conclusion.

I think the second bound is indeed false. Consider $t = 2$ and $g_2 = 0$, then the assertion reduces to showing that $1 - \beta_1 \leq \sqrt{1-\beta_2}$ implies

$$ \frac{\beta_1}{1 + \beta_1} < \sqrt{\frac{\beta_2}{1 + \beta_2}}.$$

But I can simply take $\beta_1 = 1$ and $\beta_2 = 0$, to satisfy the condition. However the inequality above is clearly false in this case.