What makes Krylov subspaces so special?

2.1k Views Asked by At

I am reading through "Numerical Linear Algebra", by Lloyd N. Trefethen and David Bau. I reached the final chapter about iterative methods. They all share the fact, that they are using Krylov subspaces.

What I understood so far: We are using those subspaces $K_{n}=<b, Ab, AAb, ... A^{n-1}b>$ for different problems, e.g. solving systems, that require dimensions much higher than $n$. Usually we first orthonormalize $K_n$, e.g. by using the Arnoldi iteration, so it is more stable. This way, we can save much time and computing-cost, and the solution can converge nicely, even for $n<<m$.

My question is: What makes Krylov spaces in particular so special? Why doesn't it work well (or does it?) for arbitrary other orthonormal bases, that I can expand dimension by dimension, too?

3

There are 3 best solutions below

1
On BEST ANSWER

Strictly speaking Krylov methods do not always work well. For instance it is possible to construct $A \in \mathbb{R}^{n \times n},b \in \mathbb{R}^n$, $A$ invertible, $b$ nonzero such that the first $n-1$ GMRES solutions to $Ax=b$ all have the same residual, and the residual only goes to zero at the $n$th step. This situation turns out to be rather pathological, though (i.e. you have to construct the problem from scratch, real world problems don't look like this).

Getting back to more realistic situations, you can use other bases as you see fit. But by directly studying Krylov methods, which are mathematically convenient, you can find out when they perform well (by proving convergence estimates), and then divide a general problem between "convert the problem to a problem where Krylov methods perform well" and "use a Krylov method". That first step is called preconditioning; there are lots of preconditioners for Krylov methods out there.

1
On

A good way to motivate the use of Krylov subspace methods is the following idea, which is closely related to a derivation of the conjugate gradient method. $\newcommand{\dotp}[2]{\langle #1,#2 \rangle}$

Assume, we have an initial guess $x_0$ for our solution $x$ of the linear system $$ A x = b \,, $$ where $A$ is a symmetric, positive definit matrix. Let us denote the error by $e_0$; $$ e_0 := x - x_0 \,. $$ Furthermore, let the residual $$ r_0 := b - A x_0 \,. $$

We now want to make a correction step of the form $$ x_1 = x_0 + p \,, $$ to improve our approximation, where $p$ is some vector that we still need to determine. The error after the correction $e_1$ is $$ e_1 = x - x_1 = x - (x_0 + p) = e_0 - p \,. $$ We take a look at the $A$-norm of the error; we have that \begin{align*} \| e_0 - p \|_A^2 &= \dotp{A e_0 - A p}{e_0 - p} \\ &= \dotp{A e_0}{e_0} - \dotp{A e_0}{p} - \dotp{A p}{e_0} + \dotp{A p}{p} \\ &= \dotp{A e_0}{e_0} - 2 \dotp{Ae_0}{p} + \dotp{Ap}{p} \\ &= c - 2\dotp{r_0}{p} + \dotp{Ap}{p} \,, \end{align*} where we have used that $A e_0 = r_0$ and where $c$ is some constant independent of $p$. Let us define the function $\phi$ by setting $$ \phi(p) = \dotp{Ap}{p} - 2 \dotp{r_0}{p} + c \,. $$ The function $\phi$ is just the error of $e_1$ in dependence of the correction $p$ measured in the $A$-norm. Hence, we want to choose $p$ such that $\phi(p)$ is small. For this purpose, we compute the gradient of $\phi$.

A small computation which uses the symmetry of $A$ gives that $$ \nabla \phi(p) = 2(Ap - r_0) \,. $$ Recall that the negative of the gradient always points in the direction in which the function decreases the most. At position $x_0$, where $p=0$, we have that $$ \nabla \phi(0) = -2r_0 \,. $$ Thus, the function $\phi$ decreases the most in the direction of the residual $r_0$. This direction, is also called the direction of steepest decent.

For this reason, assume that we choose $p = \alpha_0 r_0$ for some scalar $\alpha$. In the next step, we choose, again, a vector $p$ and add its value to the current iterate. Using the same argument we choose $\alpha_1 r_1$ for some scalar $\alpha_2$. We continue in this manner. Thus, we have that $$ x_{k+1} = x_0 + \alpha_0 r_0 + \alpha_1 r_1 + \cdots + \alpha_k r_k \,, $$ in other words that $$ x_{k+1} \in x_0 + \mbox{span}\{ r_0, r_1, \dots, r_k \} \,. $$

Now, using the relation that $$ r_{k+1} = b - A x_{k+1} = b - A (x_k + \alpha_k r_k) = r_k - \alpha_k A r_k \,, $$ we see that $$ \mbox{span}\{ r_0, r_1, \dots, r_k \} = \mbox{span}\{ r_0, A r_0, \dots, A^{k-1} r_0 \} \,. $$ Hence, a Krylov subspace is the space spanned by the vectors of successive directions of steepest decent.

If you want to read more about a derivation of the conjugate gradient method which uses the above mentioned minimization principle, see An Introduction to the Conjugate Gradient Method Without the Agonizing Pain by J. R. Shewchuk.

0
On

While the other answers are good, it somewhat ignores the point of how Krylov methods work. If you take for instance the Arnoldi method, we get something like the following.

$$ A = QHQ^{*} $$

Then we let $$ Q_{n} \in \mathbb{C}^{m \times n} $$

which is the first n columns of Q.

we then have $$ \tilde{H}_{n} \in \mathbb{C}^{(n+1) \times n}$$ be the upper left section of H which is also a Hessenberg matrix. Then we get the following . $$AQ_{n} = Q_{n+1}\tilde{H_{n}} $$ Now the matrices here $$Q_{n}$$ which are generated by the arnoldi iteration are reduced QR factors of the Krylov matrices. That is the following. $$ K_{n} = Q_{n}R_{n} $$ With each of the Hessenberg matrices $$ H_{n} $$ being a corresponding projection and the successive iteration here $$ AQ_{n} = Q_{n+1}\tilde{H_{n}}$$

This gives us a final projection like this after some work.

$$ H_{n} = Q_{n}^{*}AQ_{n} $$

Which means that although we may have a very large matrix $$ A \in \mathbb{C}^{m \times m} $$ we can some cases a very small amount of eigenvectors to construct a good approximation for A. This happens to be very useful in some types of approximations for partial differential equations where the matrices are million by million or larger and standard methods would take many years.