I am studying numerical linear algebra with the help of Trefethen and Bau's book. I have come across a proof I do not fully understand. I will attempt to bring the statement and proof as they appear in the book. I apologize in advance for the long post, but I wish to bring the reasoning as it appears in the book.
Let $A$ be an $m \times m$ real, symmetric matrix. Suppose we have $n$ linearly independent vectors $v_1^{(0)},v_2^{(0)},\dots,v_n^{(0)}$, and define $V^{(0)}$ to be the $m \times n$ matrix whose columns are $v_1^{(0)},v_2^{(0)},\dots,v_n^{(0)}$ in this order. Now we define the matrix $V^{(k)}=A^kV^{(0)}$ (the result after $k$ applications of $A$) and we extract a reduced QR factorization of this matrix $\hat{Q}^{(k)}\hat{R}^{(k)}=V^{(k)}$ where $\hat{Q}^{(k)}$ is an $m \times n$ matrix whose columns form an orthonormal basis for the column space of $V^{(k)}$. We make the following assumption on the eigenvalues of $A$: $|\lambda_1|>|\lambda_2|>\dots>|\lambda_n|>|\lambda_{n+1}| \geq |\lambda_{n+2}| \geq \dots \geq |\lambda_{m}|$. Next, we define the matrix $Q$ whose columns are the normalized eigenvectors of $A$ in the order corresponding to the ordering of the eigenvalues in the assumption, and we take $\hat{Q}$ to be the $m \times n$ matrix whose columns are the first $n$ eigenvectors of $A$, $q_1,\dots,q_n$. We now note that $\hat{Q}$ and $\hat{Q}^{(k)}$ are entirely different matrices despite the similar notation. Now, we assume that all the leading principal minors of $\hat{Q}^TV^{(0)}$ are nonsingular.
Now, we have the following theorem
Assume the notation and assumptions used above. Then as $k \to \infty$, the columns of $\hat{Q}^{(k)}$ converge linearly to the eigenvectors of $A$, so that $$|q_j^{(k)}- \pm q_j| = O(C^k)$$ for each $j$ with $1 \leq j \leq n$ where $C < 1$ is the constant $\max_{1 \leq k \leq n} |\lambda_{k+1}/|\lambda_k|$. The $\pm$ sign means that at each step $k$ one or the other choice of sign is to be taken.
Proof: Extend $\hat{Q}$ to a full $m \times m$ orthogonal matrix $Q$ of eigenvectors of $A$ (in the order matching the ordering of the eigenvalues above), and let $\Lambda$ be the corresponding diagonal matrix of eigenvalues so that $A=Q\Lambda Q^T$. Just as $\hat{Q}$ is the leading $m \times n$ section of $Q$, define $\tilde{\Lambda}$ (still diagonal) to be the leading $n \times n$ section of $\Lambda$. Then we have $$V^{(k)}=A^kV^{(0)}=Q\Lambda^k Q^TV^{(0)}=\hat{Q}\hat{\Lambda}^k\hat{Q}^TV^{(0)}+O(|\lambda_{n+1}|^k)$$ as $k \to \infty$. Due to the assumption on $\hat{Q}^TV^{(0)}$, then this matrix itself is nonsingular in particular. Thus, we can multiply the term $O(|\lambda_{n+1}|^k)$ on the right by $\left(\hat{Q}^TV^{(0)}\right)^{-1}\hat{Q}^TV^{(0)}$ to transform our equation to $$V^{(k)}=(\hat{Q}\hat{\Lambda}^k+O(|\lambda_{n+1}|^k))Q^TV^{(0)}.$$ Since $\hat{Q}^TV^{(0)}$ is nonsingular, the column space of this matrix is the same as the column space of $$\hat{Q}\hat{\Lambda}^k+O(|\lambda_{n+1}|^k).$$ From the form of $\hat{Q}\hat{\Lambda}^k$ and the assumption on the ordering of the eigenvalues, it is clear that this column space converges linearly to that of $\hat{Q}$. This convergence can be quantified, for example, by defining angles between subspaces; we omit the details. Now, in fact, we have assumed that not only is $\hat{Q}^TV^{(0)}$ nonsingular but so are all of its leading principal minors. It follows that the argument above also applies to leading subsets of the columns of $V^{(k)}$ and $\hat{Q}$: the first column, the first and second columns, the first second and third columns, and so on. In each case, we conclude that the space spanned by indicated columns of $V^{(k)}$ converges linearly to the space spanned by the corresponding columns of $\hat{Q}$. From the convergence of the successive column spaces, together with the definition of the $QR$ factorization, the convergence result follows.
Here is where I get confused:
- How did the authors obtain $$V^{(k)}=A^kV^{(0)}=Q\Lambda^k Q^TV^{(0)}=\hat{Q}\hat{\Lambda}^k\hat{Q}^TV^{(0)}+O(|\lambda_{n+1}|^k)$$, and what is the exact meaning of the big O notation here for matrices?
- I do not understand the statement in boldface. How does the form of $\hat{Q}\hat{\Lambda}^k$ and the assumption on the ordering of the eigenvalues lead to linear convergence?
I would appreciate any and all help in understanding and clarifying these points. I really want to know how the authors conclude linear convergence in the statement in boldface. I thank all helpers.
(1) They're using the fact that $A = Q\Lambda Q^T$. This is the diagonalization of the matrix $A$. The matrix $Q\Lambda^kQ^T$ can also be expressed as
$$ Q\Lambda^kQ^T = \sum_{i=1}^m \lambda_i^k q_iq_i^T$$
The term $\hat{Q}\hat{\Lambda}^k\hat{Q}^T$ represents the first $n$ elements of this sum. The $O(|\lambda_{n+1}|^k)$ represents the rest of the sum. The Big O notation is saying the entire rest of the sum is bounded by a constant $C$ times $|\lambda_{n+1}|^k$. To be more precise, you can observe that since each $q_i$ is a unit vector its entries are less than or equal to one. Then all the entries in the matrix $q_iq_i^T$ are also less than or equal to 1.
Thus, using $\mathbf{J}$ to stand for the $m \times m$ matrix of all 1's
$$|\sum_{i=n+1}^m \lambda_i^{k}q_iq_i^T| \leq \sum_{i=n+1}^m |\lambda_i^{k}|\mathbf{J} \leq (m-n)|\lambda_{n+1}|^k\mathbf{J}$$
where the matrix inequalities $A \leq B$ is saying every entry of $A$ is less than or equal to the corresponding entry of $B$. So in other words I'm saying that all the entries in the sum $\sum_{i=n+1}^m \lambda_i^kq_iq_i^T$ are less than $(m-n)|\lambda_{n+1}|^k$ in absolute value. Since this term is bounded by a constant times $|\lambda_{n+1}|^k$ we use big O notation and we can say
$$\sum_{i=n+1}^m \lambda_i^kq_iq_i^T = O(|\lambda_{n+1}|^k)$$
This sum is then multiplied by $V^{(0)}$ but this is just a constant matrix so it's still $O(|\lambda_{n+1}|^k)$.
(2) The basic idea is that since $|\lambda_{n+1}|$ is strictly less than $|\lambda_n|$, the contribution of the $O(|\lambda_{n+1}|^k)$ part of the sum becomes insignificant compared to the contribution of the first $n$ terms. Just like how for higher powers of $n$, $1.9^n$ becomes a very tiny fraction of $2^n$. So, for high values of $k$, the matrix $Q\Lambda^kQ^T$ can be approximated to a high degree of accuracy by completely ignoring the $O(|\lambda_{n+1}|^k)$ part of the sum, or in other words
$$Q\Lambda^kQ^T \approx \hat{Q}\hat{\Lambda}^k\hat{Q}^T$$
The convergence is called "linear" because the error of this approximation is $O(|\lambda_{n+1}/\lambda_n|^k)$ which goes to 0 exponentially fast. This is in contrast to, say, quadratic convergence which is an error that drops like $\gamma^{k^2}$ for some $0 < \gamma < 1$, which falls much faster than an exponential.