I am trying to see if I understand the GMRES method and it's result. But somewhere I get confused and I wonder if I am making a mistake. We start with a system $Ax=b$. We look for approximate solutions not in the whole space $\Bbb R^n$ but in the subspace $\mathcal{K}_m =\{b,Ab,\ldots ,A^{m-1}b \}$, where $m$ is typically much smaller than $n$. The approximation we choose is the vector $x_m$ that minimizes the 2-norm of the residual: $$x_m = \operatorname{argmin}\limits_{y\in \mathcal{K}_m} \| b-Ay \|_2.$$ Now after some manipulation we find that this is equivalent to $$x_m = \operatorname{argmin}\limits_{y\in \mathcal{K}_m} \left\{ -2y^TA^Tb+y^TA^TAy \right\}.$$ Then we recognize that via a Gram Schmidt like procedure we can find an orthonormal basis for $\mathcal{K}_m$, denoted by $Q_m$ that satisfies $AQ_m = Q_{m+1}H_{m+1,m}$ where $H_{m+1,m}$ is an $(m+1)\times m$ Hessenberg matrix. Now we may write $x_m= Q_m\bar{x}$ and we can find our approximation for via $$\bar{x} = \operatorname{argmin}\limits_{y\in \Bbb R^m} \left\{ -2y^TQ_m^TA^Tb+y^TQ_m^TA^TAQ_my \right\}=\operatorname{argmin}\limits_{y\in \Bbb R^m} \left\{ -2y^TQ_m^TA^TQ_mQ_m^Tb+y^TQ_m^TA^TQ_mQ_m^TAQ_my \right\}=\operatorname{argmin}\limits_{y\in \Bbb R^m} \left\{ -2y^TH_n^T\bar{b}+y^TH_n^TH_ny \right\}.$$ Where $\bar{b} = Q_m^Tb$. So far so good I think. Just as an observation, the last minimization problem looks exactly like the conjugate gradient approach to the problem $H^THy=H^T\bar{b} $.
Now as I understand it, the GMRES method is all about finding $H_n$ using the Arnoldi algorithm. Then we can solve the minimization problem by differentiating w.r.t. $y$ and setting to zero: $$-2H_n^T\bar{b} + 2H_n^TH_n y = 0$$ giving $y = (H_n^TH_n)^{-1}H_n^TQ_n^Tb$ so that our approximation is given by $x_m = Q_n (H_n^TH_n)^{-1}H_n^TQ_n^Tb$. Now we add some $Q_n^TQ_n$ in there to get $$x_m = Q_n (H_n^TH_n)^{-1}Q_n^TQ_nH_n^TQ_n^Tb =(Q_nH_n^TH_nQ_n^T)^{-1}Q_nH_n^TQ_n^Tb = (Q_nH_n^TQ_n^TQ_nH_nQ_n^T)^{-1}Q_nH_n^TQ_n^Tb = (A^TA)^{-1}A^Tb = Ab. $$Now this is going wrong somewhere. I guess in the case that $H$ and $Q$ are $n\times n$ the result is right in that the approximation will be an exact answer. Normally in projections the matrices that will be inverted are not square but in this case the Hessenberg matrices are square. What am I doing wrong here? Also, is the rest of the derivation, say up to the approximation $x_m = Q_n (H_n^TH_n)^{-1}H_n^TQ_n^Tb$ correct? Thanks for the help!
The GMRES method indeed computes the exact solution at some iteration $m\leq n$. For this reason, the "optimal" Krylov subspace methods such as CG or GMRES can be considered as direct methods. In comparison to direct methods based on Gaussian elimination, however, you can stop them prematurely when a certain tolerance is achieved.
In order to make things clear, let me "rederive" the GMRES method here. Starting from an initial guess $x_0$ with the associated residual vector $r_0:=b-Ax_0$, the GMRES method seeks the approximate solution $x_k$ at step $k$ (to avoid confusion of $m$ with $n$) in the affine subspace $$x_0+\mathcal{K}_k:=x_0+\mathrm{span}(r_0,Ar_0,\ldots,A^{k-1}r_0).$$
The first step is to obtain a basis $Q_k=[q_1,\ldots,q_k]$ of $\mathcal{K}_k$. One could of course use the vectors $[r_0,Ar_0,\ldots,A^{k-1}r_0]$. However, this approach is strongly discouraged since the vectors $r_0,Ar_0,\ldots,A^{k-1}r_0$ are nearly linearly dependent (note that they converge to the dominant eigenvector of $A$) and actually the condition number of the basis grows exponentially with $k$. For numerical reason, one should use the best conditioned basis possible, which is the orthogonal one. This is achieved by the Arnoldi method with $r_0$ used as the starting vector, which can be considered as a variant of the Gram-Schmidt orthogonalization method. At step $k$, we have $$\tag{1} AQ_k=Q_{k+1}H_{k+1,k}, \quad Q_ke_1=r_0/\rho, \quad \rho:=\|r_0\|_2, $$ where $Q_k$ has orthogonal columns ($Q_k^TQ_k=I_k$) and $H_{k+1,k}$ is $(k+1)\times k$ upper Hessenberg. Indeed, the Arnoldi method can be considered as the Gram-Schmidt process applied to $[r_0,AQ_k]$: $$ [r_0,AQ_k]=[\rho\,Q_{k+1}e_1,Q_{k+1}H_{k+1,k}]=Q_{k+1}[\rho\,e_1,H_{k+1,k}]. $$ The right-hand has the form of the QR factorization since $Q_{k+1}$ has orthogonal columns and $[\rho\,e_1,H_{k+1,k}]$ is upper triangular.
Now in GMRES, we seek $x_k\in x_0+\mathcal{K}_k=x_0+\mathcal{R}(Q_k)$, so we can write $x_k=x_0+Q_ky_k$ for some $k$-vector $y_k$, so that the approximate solution minimizes the 2-norm of the residual vector $b-A\tilde{x}$ over all $\tilde{x}\in x_0+\mathcal{K}_k$: $$ \|b-Ax_k\|_2=\min_{\tilde{x}\in x_0+\mathcal{K}_k}\|b-A\tilde{x}\|_2 =\min_{\tilde{y}}\|r_0-AQ_k\tilde{y}\|_2. $$ We can use (1) to get $r_0-AQ_k\tilde{y}=Q_{k+1}(\rho\,e_1-H_{k+1,k}\tilde{y})$ and the fact that the 2-norm is orthogonally invariant to get $$\tag{2} \|b-Ax_k\|_2=\min_{\tilde{y}}\|r_0-AQ_k\tilde{y}\|_2=\min_{\tilde{y}}\|\rho\,e_1-H_{k+1,k}\tilde{y}\|_2. $$ Solving the reduced problem (2) gives the components $y_k$ of the GMRES correction. Since (2) is nothing but a least squares problem, we can formally solve it using the pseudo-inverse of $H_{k+1,k}$: $$\tag{3} y_k=\rho\,H_{k+1,k}^\dagger e_1=\rho\,(H_{k+1,k}^TH_{k+1,k})^{-1}H_{k+1,k}^Te_1, $$ where the latter can be done since $H_{k+1,k}$ has rank $k$ (note that $A$ is nonsingular because $Q_k^TAQ_k=H_k$, where $H_k$ is the square $k\times k$ "upper part" of $H_{k+1,k}$). Therefore, the approximation $x_k$ is given by $$\tag{4} x_k=x_0+Q_ky_k=x_0+\rho\,Q_kH_{k+1,k}^\dagger e_1. $$ Note that (3) is not how the vector $y_k$ is obtained in practice. Instead, one solves the least squares problem (2) by the QR factorization of $H_{k+1,k}$ (usually using the Givens rotations due to the Hessenberg structure).
At some point, the Krylov subspace is "exhausted" in the sense that there is an $m$ such that $\dim\mathcal{K}_k=m$ for all $k\geq m$. This $m$ is equal to the smallest $k$ for which $r_0\in A\mathcal{K}_k$ and is bounded from above by the degree of the minimal polynomial of $A$ (therefore, $m\neq n$). For $k=m$, (1) translates to $$\tag{5} AQ_m=Q_mH_m, $$ where $H_m$ is $m\times m$ nonsingular upper Hessenberg.
At this step $k=m$, GMRES computes the exact solution of $Ax=b$. This can be seen simply as follows: since $r_0\in A\mathcal{K}_k$, we have $A^{-1}r_0=x-x_0\in A^{-1}A\mathcal{K}_m=\mathcal{K}_m$ and hence $x\in x_0+\mathcal{K}_m$. Therefore, the affine space $x_0+\mathcal{K}_m$ contains the solution $x$ and since GMRES minimizes the residual norm, it necessarily must give $x_m=x$. Also, (4) and (5) give $x_m=x_0+\rho\,Q_mH_m^{-1}e_1$ and hence $$ Ax_m=Ax_0+\rho\,AQ_mH_m^{-1}e_1=Ax_0+\rho\,Q_me_1=Ax_0+r_0=b. $$