(steepest) gradient descent for minimizing a quadratic function $\langle x, Ax \rangle$ with $A \succeq 0$

844 Views Asked by At

Suppose $f(x) = \langle x, Ax \rangle + \langle b, x \rangle$ where $A$ is positive semidefinite and $x \in \mathbb{R}^n$. Let

$$0 = \lambda_1(A) \le \dots \le \lambda_n(A)$$

be the ordered eigenvalues and let $v$ be the nonzero eigenvector associated with eigenvalue $0$. We know the function is convex and has $\lambda_n$-Lipschitz gradient, that is,

$$|\nabla f(x) - \nabla f(y) | \le \lambda_n \|x-y\|$$

since $\|\nabla^2 f(x) \| = A \succeq 0$. This very condition also implies that every critical point of $f$ is a global minimizer. It is known that (steepest) descent for functions with $L$-Lipschitz gradient, starting from $x_0 \neq 0$ and $x_0 \neq v$, generates iterates by following rule \begin{align*} x_{k+1} = x_k - \frac{1}{\lambda_n} (Ax_{k}+b). \end{align*} Suppose $x^*$ is a critical point, i.e., $Ax^*+b = 0$.

My questions are:

  1. Where does the sequence converge to? $x^*$ or $x^* + \alpha v$ where $\alpha$ is some scalar.

  2. Suppose now we only care about the convergence to critical points of $f(x)$, can we use following update rule \begin{align*} x_{k+1} = x_k - \frac{2}{\lambda_2 + \lambda_n} (Ax_k+b). \end{align*} The step size is chosen by considering the function as $\lambda_2$-strongly convex on the space $v^{\perp}$. Does this update rule make sense? If so, how to argue this rigorously?

This problem is treated in most optimization books, but the assumption is always $A \succ 0$ based on what I read. Thanks in advance.

1

There are 1 best solutions below

0
On BEST ANSWER

The answer to your question is yes assuming that $\lambda_2>0$. (There appears to be some typos in the original post, I will work with $f(x) = \tfrac{1}{2}\langle x,Ax\rangle + \langle b,x\rangle$, which has $\nabla f(x) = Ax+b$, which is $\|A\|$-Lipschitz, where $\|A\|=\lambda_n$ is the largest eigenvalue of $A$.) I also set $C := \{x | Ax+b = 0\}$ and assume $C$ is nonempty.

In the following, [BC] is Bauschke-Combettes Convex Analysis and Monotone Operator Theory in Hilbert Spaces (Springer, 2017, second edition), and [BLM] is Bauschke-Lukens-Moursi Affine Non Expansive Operators, Duality and the Douglas Rachford Algorithm which appeared in Set-Valued and Variational Analysis 25 (3), 481-505

1) Note that $\nabla f$ is $\lambda_n$-Lipschitz continuous. Thus by [BC, Theorem 18.5], $\frac{1}{\lambda_n} \nabla f$ is of the form identity minus proximal mapping.

2) In turn, by [BC, Proposition 12.28], $$T := \textrm{Id}-\tfrac{1}{\lambda_n}\nabla f$$ is a proximal mapping as well and in particular firmly nonexpansive.

3) Now let $0<\mu<2$. Then $$T_\mu := (1-\mu)\textrm{Id}+\mu T = \textrm{Id} - \tfrac{\mu}{\lambda_n}\nabla f \;\;\text{is averaged.}$$

4) Note that $T_\mu$ is thus averaged and affine, with linear part $L_\mu := \textrm{Id}-\tfrac{\mu}{\lambda_n}A$ which is averaged and linear.

5) By [BC, Example 5.29], the iterates of $L_\mu$ converge pointwise to the projection of the starting point onto the kernel of $A$.

6) By [BLM, Corollary 2.8] the convergence in 5) is linear.

7) By [BLM, Theorem 3.5], the convergence of $T_\mu$ is also linear and the limit is $P_C(x_0)$.

8) The case you care about arises when $$\mu := 2\frac{\lambda_n}{\lambda_2+\lambda_n} < 2$$ provided that $\lambda_2>0$.

9) If you want, you can also use $$\mu := 2\frac{\lambda_n}{\lambda_{n-1}+\lambda_n} < 2$$ and simply assume that $\lambda_{n-1}>0$.