Prove the steepest descent algorithm for solving $Ax = b$

379 Views Asked by At

Prove that the steepest descent algorithm for solving $Ax = b$, where $A$ is symmetric and positive definite, can be rewritten as follows:

  • Compute the residual at the $k^{\text{th}}$ step: $r_k = b − Ax_k$

  • $\alpha_k = \frac{r_k^T r_k}{r_k^T A r_k}$

  • $x_{k+1} = x_k + \alpha_k r_k$

1

There are 1 best solutions below

0
On

To apply steepest descent, you first need an objective function. Below, I give a small motivation where $$f(x) := \boldsymbol x^T A \boldsymbol x + \boldsymbol b^T \boldsymbol x = \langle \boldsymbol x, A \boldsymbol x\rangle - \langle \boldsymbol x, \boldsymbol b \rangle \tag1$$ could be derived from.

Okay, let's say we want to solve the linear system $$A\boldsymbol x=\boldsymbol b \tag2$$ with symmetric positive matrix (s.p.d) $A$. From very basic multidimensional optimization, one knows that if the Hessian $\nabla^2 f(\boldsymbol x) $ of the objective function $f(\boldsymbol x)$ is positive definite $\forall \: \boldsymbol x$, the objective function is globally strictly convex and thus there is only one global minimizer. Let's try to leverage the s.p.d. property of $A$: We aim for an objective function such that $$\nabla^2 f(\boldsymbol x) \overset{!}{=} A. \tag3$$ Again from optimization theory for convex functions, we know that we are at the optimum $\boldsymbol x^\star$ if $\nabla f(\boldsymbol x^\star) = \boldsymbol 0 $. The optimal $\boldsymbol x^\star$ for our problem would be the one that solves the linear system: $$A\boldsymbol x^\star =\boldsymbol b.\tag4$$ We see that this can be enforced by "integrating/solving the PDE" $\nabla^2 f(\boldsymbol x) = A$ and picking the integration constant in a clever way, namely as $-\boldsymbol b$: $$\nabla f(x) = A \boldsymbol x - \boldsymbol b. \label{5} \tag5$$ Again, by clever integration / PDE solve, you can construct the objective function $$f(\boldsymbol x) = \frac12 \boldsymbol x^T A \boldsymbol x - \boldsymbol x^T \boldsymbol b. \tag6$$


Now solving the linear system is equivalent to minimizing $f(\boldsymbol x)$. Again, this is just a motivation. Other ways of arriving at $f(\boldsymbol x)$ could stem from experience or constructing the sort of simplest convex function involving $A, \boldsymbol b$.

Now continue with optimization 101-stuff: The first method you learn in numerical optimization is steepest/gradient descent, which can be slightly more generalized written as $$ \boldsymbol x^{k+1} = \boldsymbol x^k + \alpha \boldsymbol p^{k}. \label{7} \tag7$$ In the case of steepest descent, $$\boldsymbol p^k = -\nabla f (\boldsymbol x^k) \overset{\eqref{5}}{=} -A \boldsymbol x^k + \boldsymbol b =:\boldsymbol r^k. \label{8} \tag8 $$

Now (for actually general) $\boldsymbol p^k \neq \boldsymbol 0 $ the optimal step-length $\alpha$ can be determined by considering $$f\big(\boldsymbol x^{k+1} \big) = f\big(\boldsymbol x^k + \alpha \boldsymbol p^{k}\big) = f\big(\boldsymbol x^k \big) + \alpha \big\langle \boldsymbol p^k , \underbrace{A \boldsymbol x^k -\boldsymbol b}_{=-\boldsymbol r^k} \big\rangle + \frac12 \alpha^2 \big\langle \boldsymbol p^k , A \boldsymbol p^k \big\rangle =: g(\alpha) \label{9}\tag9$$ For fixed $\boldsymbol x^k, \boldsymbol p^k$ this is a scalar, convex, univariate function in $\alpha$ with minimum at $\alpha^\star$ such that $g'(\alpha^\star) = 0$ which is clearly the case when $$\alpha^\star \big(\boldsymbol p^k, \boldsymbol r^k \big) = \frac{ \big \langle \boldsymbol p^k, \boldsymbol r^k \big \rangle }{\big \langle \boldsymbol p^k,A \boldsymbol p^k \big \rangle} \overset{\eqref{8}}{=} \frac{ \big \langle \boldsymbol r^k, \boldsymbol r^k \big \rangle }{\big \langle \boldsymbol r^k,A \boldsymbol r^k \big \rangle}\label{10} \tag{10}$$

Here, you can savely divide by $\big \langle \boldsymbol r^k,A \boldsymbol r^k \big \rangle$ since $A$ is s.p.d. and it is reasonable to assume that you are not yet at the true solution, i.e., $\boldsymbol r^k \neq \boldsymbol 0$ since then you would not perform any update.