I am trying to understand the convergence analysis/derivation of the momentum algorithm, or the stochastic heavy ball algorithm, using the regret bound analysis from different research papers.
- https://ieeexplore.ieee.org/document/7330562 - Page3
- https://www.mdpi.com/2504-3110/6/12/709 - Page6
- http://arxiv.org/abs/1707.01647 - Page4
In the derivation, there is the following simplification, which I do not understand at all
$\frac{2\boldsymbol{\eta}_{k}}{(1-\beta)}\sum_{k=0}^{T}\left|J(\theta_k) - J(\theta^*)\right| + \frac{2\boldsymbol{\eta}_{k}\beta}{(1-\beta)^2} \sum_{k=0}^{T}\left|J(\theta_k) - J(\theta_{k-1})\right| \leq \ \left|\boldsymbol{\theta}_{0} + \boldsymbol{p}_{0} - \boldsymbol{\theta}^* \right|^2 - \left|\boldsymbol{\theta}_{T+1} + \boldsymbol{p}_{T+1} - \boldsymbol{\theta}^*\right|^2 \leq \left| \boldsymbol{\theta}_{0} - \boldsymbol{\theta}^* \right|^2$
The term $\left|\boldsymbol{\theta}_{0} + \boldsymbol{p}_{0} - \boldsymbol{\theta}^* \right|^2 - \left|\boldsymbol{\theta}_{T+1} + \boldsymbol{p}_{T+1} - \boldsymbol{\theta}^*\right|^2$ is simplified to $\leq \left| \boldsymbol{\theta}_{0} - \boldsymbol{\theta}^* \right|^2$ directly. I am sure that there is certainly some assumption or some intermediate steps involved and the reader is supposed to know that. Could you please help me understand.
My understanding: In the context of convex optimization, at any iteration the squared distance between the parameter vector $\theta_k$ and the optimal parameter vector $\theta^*$ never increases as the optimization progresses. This infers that the squared distance from the final parameter vector (plus some perturbation $\boldsymbol{p}$) to the optimal parameter vector is less than or equal to the squared distance from the initial parameter vector (plus some perturbation $\boldsymbol{p}$) to the optimal parameter vector, i.e., $\left| \boldsymbol{\theta}_{T+1} + \boldsymbol{p}_{T+1} - \boldsymbol{\theta}^* \right|^2 \leq \left| \boldsymbol{\theta}_{0} + \boldsymbol{p}_{0} - \boldsymbol{\theta}^*\right|^2$. Also, it is assumed that $\boldsymbol{p_0 = 0}$, so $|\boldsymbol{\theta}_{0} + \boldsymbol{p}_{0} - \boldsymbol{\theta}^*|^2 - |\boldsymbol{\theta}_{T+1} + \boldsymbol{p}_{T+1} - \boldsymbol{\theta}^*|^2 \leq | \boldsymbol{\theta}_{0} - \boldsymbol{\theta}^*|^2$