Derivation of the upper bound of $\sum_{t}\sqrt{\frac{g^{2}_{t,i}}{t}}$ in Adam optimizer's regret bound

102 Views Asked by At

Source: https://arxiv.org/pdf/1412.6980.pdf

This result (Lemma $10.3$, page $12$) is used later to derive the regret bound, $R(T)$. Here $t=1\ldots T$ are the iterations of the algorithm, $g_{i,t}$ is the gradient of the loss functions wrt some weight $\theta_i, \frac{\partial L}{\partial \theta_i} \ |g_{t}|_2 = \sqrt{\sum_{i}g^2_{t,i}}$ - norm of all gradients at time $t$, such that $|g_t|_2<G, |g_t|_{\infty}<G_{\infty}$ (page $4$).

In Lemma $10.3$ it is shown that $$ S_{T} = \sum_{t=1}^{T}\sqrt{\frac{g^2_{t,i}}{t}} < 2 G_{\infty}|g_{1 \ldots T,i}|_2 $$
The norm is over all gradients of weight $i$ for $T$ iterations, $\sqrt{\sum_{t=1}^{T}g^{2}_{t,i}}$. It is proven by induction. Given the form of the expression on LHS, I think there should be a better way of proving this inequality. The first thing I though of is using Cauchy-Schwartz Inequality, since integrals of both $\phi_1(t) = \frac{1}{\sqrt{t}}$ and $\phi_2(t)=\sqrt{g_{t,i}}$ exist on the compact set $[1,T]$: $$ S_{T} \leq\sqrt{\sum_{t=1}^{T}\frac{1}{t}} \cdot \sqrt{\sum_{t=1}^{T}g^2_{t,i}} = \sqrt{H_{T}}|g_{1\cdot T,i}|_2 $$ here $H_T$ is a Harmonic number, $H(T) \approx \log T$. This is worse than the bound proven in the article, which is a constant $2G_{\infty} = 2 \max_{i}|g_{t,i}|$. I tried closing the square, i.e. writing the expression as $$ h(t) = \bigg(\frac{1}{\sqrt{2t}} + \sqrt{\frac{g^{2}_{t,i}}{2}}\bigg)^2 =\frac{1}{2t} + \frac{g^{2}_{t,i}}{2} + \sqrt{\frac{g^{2}_{t,i}}{t}} $$ then taking sums of expression on RHS and showing that the are less than some upper bound. I also understand that $\forall i |g_{t,i}| \leq max_{i} |g_{t,i}| = G_{\infty}$, but somehow can't put these conditions together into a nice proof.