Accelerated Randomized Coordinate Descent

147 Views Asked by At

This paper is by a famous professor. It has several hundreds of citations. There must be someone who understands this paper. If someone out there is experienced in optimization, please have a look at this.

In Coordinate Descent Algorithms

https://link.springer.com/article/10.1007%2Fs10107-015-0892-3

by Stephen J. Wright an assumption (page 14) is given by:

Assumption 1 The function $f$ in $\text{ min } f(x) $ is convex and uniformly Lipschitz continuously differentiable, and attains its minimum value $f^{*}$ on a set $S$. There is a finite $R_{0}$ such that the level set for f defined by $x_0$ is bounded, that is, $$ \max _{x^{*} \in \mathcal{S}} \max _{x}\left\{\left\|x-x^{*}\right\|: f(x) \leq f\left(x^{0}\right)\right\} \leq R_{0} $$

Then in the part dealing with accelerated randomized coordinate descent (page 19) it is stated:

Theorem 2: *Suppose that Assumption 1 holds, and define $$ S_{0}:=\sup _{x^{*} \in \mathcal{S}} L_{\max }\left\|x^{0}-x^{*}\right\|^{2}+\left(f\left(x^{0}\right)-f^{*}\right) / n^{2} $$ Then for all $k\ge0$ we have

\begin{aligned} E&(f(x^k)) - f^* \nonumber \\&\le S_0 \frac{\sigma }{L_\mathrm{max}} \left[ \left( 1+ \frac{\sqrt{\sigma /L_\mathrm{max}}}{2n} \right) ^{k+1} - \left( 1- \frac{\sqrt{\sigma /L_\mathrm{max}}}{2n} \right) ^{k+1} \right] ^{-2} \\ &\le S_0 \left( \frac{n}{k+1} \right) ^2. \end{aligned}

where $\sigma$ is the modulus of strong convexity and $L_{\text{max}}$ is the coordinate Lipschitz constant.

Then he concludes the following: The term $$ \left(1+\frac{\sqrt{\sigma / L_{\max }}}{2 n}\right)^{k+1} $$ eventually dominates the second term $$ \left(1-\frac{\sqrt{\sigma / L_{\max }}}{2 n}\right)^{k+1} $$ so that the linear convergence rate suggested by this expression is significantly faster than the corresponding rate $$ E\left[f\left(x^{k}\right)\right]-f^{*} \leq\left(1-\frac{\sigma}{n L_{\max }}\right)^{k}\left(f\left(x^{0}\right)-f^{*}\right) \quad \forall k \geq 1 $$ for Algorithm 3 (which is the randomized coordinate descent without acceleration).

Finally my problem: I just cant follow his logic and cant see why this expression is significantly faster than the other one.

I am very thankful for any hint.

1

There are 1 best solutions below

3
On

To see why the convergence rate is faster, you can compare two quantities: $1-\frac{\sigma}{n L_{\max}}$ and $\Bigl[ \bigl(1+\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^{k+1} - \bigl(1-\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^{k+1}\Bigr]^{-2}$. When $k$ becomes larger, the later one can be represented as

$$ \frac{1}{\biggl( \bigl(1+\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^{k+1} - \bigl(1-\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^{k+1} \biggr)^2} \approx \frac{1}{\biggl( \bigl(1+\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^{k+1} \biggr)^2} = \Biggl( \frac{1}{\bigl(1+\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^2} \Biggr)^{k+1}. $$

Now using first-order Taylor expansion we have $\frac{1}{(1+x)^2} \approx 1-2x$. Thus the rate can be further approximated by

$$ \frac{1}{\bigl(1+\frac{\sqrt{\sigma/L_{\max}}}{2n}\bigr)^2} \approx 1 - \frac{\sqrt{\sigma/L_{\max}}}{n} < 1 - \frac{\sigma/L_{max}}{n}. $$

The improvement by a factor of $\sqrt{L_{\max}/\sigma}$ in the number of iterations can be seen by the following formula:

$$ (1-\rho)^k < \epsilon \Rightarrow k > \frac{\log (1/\epsilon)}{\log(1/(1-\rho))} \approx \frac{1}{\rho} \log (1/\epsilon) , $$

where you substitute $\rho = \frac{\sigma/L_{max}}{n}$ and $\rho = \frac{\sqrt{\sigma/L_{\max}}}{n}$ respectively.