Adam Optimising Algorithm

66 Views Asked by At

In the original paper introducing Adams Optimising Algorithm (link: https://arxiv.org/pdf/1412.6980.pdf) the authors when proving the convergence analysis of the algorithm specifically states that the theorems proving the algorithm holds only when the learning rate decays at the rate of $t^{-1/2}$ but I am not able to figure out the exact reason why the theorems won't hold if the learning rate decays at the rate of say $t^{-1/4}$. What exactly will go wrong? Please help me figure out the logic behind the statement.

1

There are 1 best solutions below

0
On

If you are referring to Theorem 4.1, it says that both $\alpha_t \leq \frac{\alpha}{\sqrt{t}}$ and $\beta_{1,t} = \beta_1 \lambda^{t-1}$ for some $0<\lambda<1$ are necessary for the upper regret bound to hold. Under these conditions + assumptions on bounded gradients, the regret bound is $O(\sqrt{T})$ (note the large $T$). If you increase $O(\frac{1}{\sqrt{t}})$ to $O(t^{-\frac{1}{4}})$, i.e. make the learning rate converge faster, this bound may not hold because $$ \frac{\alpha_t}{t^{-\frac{1}{4}}} = \frac{\alpha_t}{\frac{1}{\sqrt{t}}}\cdot\frac{\frac{1}{\sqrt{t}}}{t^{-\frac{1}{4}}} \to_t c \cdot 0 \Rightarrow\alpha_t = o\bigg(\frac{1}{t^{\frac{1}{4}}}\bigg) $$