Slow convergence of gradient descent for a strictly convex quadratic

437 Views Asked by At

Let $0 < \lambda_1 \leq \lambda_2 \leq \ldots \leq \lambda_n$ and let $f: \mathbb{R}^n \to \mathbb{R}$ define by

$$ f(x) = \frac{1}{2}x^TMx $$ where M is \begin{bmatrix} \lambda_1 & 0 & \dots \\ 0 & \lambda_2 & 0 & \dots \\ \vdots & 0 & \ddots & 0 & \dots \\ & \vdots \\ & & & & \lambda_n \end{bmatrix}

To use gradient descent with exact minimization we define $t_k = \frac{||\nabla f(x)||^2}{\nabla f(x)^TM\nabla f(x)}$ and $x_{k+1} = x_k - t_k \nabla f(x_k)$.

Given $x_0 = [{\lambda_1}^{-1},0, \ldots ,0, {\lambda_n}^{-1}]^T$ I would like to show that. $$ f(x_{k+1}) = \bigg(\frac{\lambda_n - \lambda_1}{\lambda_n + \lambda_1}\bigg)^2 f(x_k)$$.

I have tried both an induction argument and just expanding $f(x_{k+1})$ to get the result but I have not had any success. If someone can see how to complete the proof I would be really interested to see.

3

There are 3 best solutions below

0
On

Let $M$ be a symmetric positive definite matrix. If $f(x)=\frac{1} {2}x^{\intercal}Mx$, then $\nabla f(x)=Mx$. The gradient descent iteration for $f$ is $$ x_{k+1}\equiv x_{k}-t_{k}\nabla f(x_{k})=\left(I-t_{k}M\right)x_{k}. $$ At each step, we would like to pick $t_{k}$ such that $$ f(x_{k+1})=\frac{1}{2}\left(M^{\frac{1}{2}}\left(I-t_{k}M\right)x_{k}\right)^{\intercal}\left(M^{\frac{1}{2}}\left(I-t_{k}M\right)x_{k}\right) $$ is minimized. Differentiating, $$ \frac{\partial}{\partial t_{k}}\left[f(x_{k+1})\right]=t_{k}x_{k}^{\intercal}M^{3}x_{k}-x_{k}^{\intercal}M^{2}x_{k}. $$ Setting this to zero and solving for $t_k$ we get $$ t_{k}=\frac{x_{k}^{\intercal}M^{2}x_{k}}{x_{k}^{\intercal}M^{3}x_{k}}=\frac{\left(Mx_{k}\right)^{\intercal}\left(Mx_{k}\right)}{\left(Mx_{k}\right)^{\intercal}M\left(Mx_{k}\right)}=\frac{\left\Vert \nabla f(x_{k})\right\Vert ^{2}}{\nabla f(x_{k})^{\intercal}M\nabla f(x_{k})} $$ (the same exact expression in your original question). Note that the above is well-defined since $x_{k}^{\intercal}M^{3}x_{k}=(Mx_{k})^{\intercal}M(Mx_{k})>0$ by the positive definiteness of $M$. Substituting the expression for $t_{k}$ back into $f(x_{k+1})$, $$ f(x_{k+1})=\frac{1}{2}\left(x_{k}^{\intercal}Mx_{k}-\frac{\left(x_{k}^{\intercal}M^{2}x_{k}\right)^{2}}{x_{k}^{\intercal}M^{3}x_{k}}\right)=f(x_{k})-\frac{1}{2}\frac{\left(x_{k}^{\intercal}M^{2}x_{k}\right)^{2}}{x_{k}^{\intercal}M^{3}x_{k}}. $$ Or, more succinctly, $$ \boxed{f(x_{k+1}) = f(x_k) - \frac{(f(M^{\frac{1}{2}} x_k))^2}{f(M x_k)}}. $$


Now, take $M=\text{diag}(2,3,5)$ and $x_{0}=(1/2, 1/3, 1/5)^\intercal$. Then, $f(x_{0})=31/60$ and $f(x_{1})=1/15$. However, $$\left(\frac{5-2}{5+2}\right)^{2} \frac{31}{60} \neq \frac{1}{15}.$$ Therefore, I don't think your claim is true. Perhaps you meant something else?

0
On

It's a standard result that $f(x_{k+1}) \leq \left( \frac{\lambda_{n}-\lambda_{1}}{\lambda_{n}+\lambda_{1}} \right)^{2} f(x_{k})$. However, this is an inequality, not an equation. What actually happens depends on exactly where you start the sequence.

The usual proof of this is based on the Kantorovich inequality. You can find it in many textbooks on optimization, e.g. Luenberger and Ye, Linear and Nonlinear Programming, 4th ed., Springer, 2015.

1
On

Your desired claim is true if you choose $$ x_0 = [\lambda_1^{-1}, \mathbf{0, \ldots, 0}, \lambda_n^{-1}]^\top.$$ With this modified initial condition, the result follows from induction.

Btw. the proof of the upper bound via Kantorovich's lemma also entails that equality holds in this case of interest.