I completely understand how projection matrix formula: $$P = A(A^TA)^{-1}A^T$$is derived from: $$ A^T(b - A\hat{x} ) = 0$$ but what I don't understand is the "story proof" or the "intuition" of the first formula as a linear transformation to the column space of $A$, as it is supposed to be.
In fact I have three specific questions:
- why would someone transform $b$ (as a vector to be projected onto $A$) into the "row space" of $A$ before it can be transformed into the column space of $A$?
- what specific transformation does the matrix $A^TA$ encode?
- why one should transform the vector to the "inverse of the tranformation $A^TA$" before it can be transformed to the column space of $A$?
here is a beautiful one:
The strategy for finding the projection vector $A\hat{x}$ in the column space of $A$ is to find a vector $p$ that has the same dot products on the columns of $A$, as $b$.
So, first one should find the dot products of $b$ on each column of $A$ through the production of $A^Tb$
Then you want to find the linear combination of columns of $A$ that gives you the same dot products.
First one must find the coefficients of this linear combination. The columns of the matrix $A^TA$ are composed of dot product of each column on the other columns and also on itself. So, the matrix $A^TA$ translates the coefficients of the columns of $A$ to the dot products on each column of $A$. Thus, $(A^TA)^{-1}$ do the reverse. it takes the dot products on each vector and spits out the necessary coefficient of each column in the linear combination. thats exactly what we want.
Remember that by the production of $A^Tb$ we found the dot product of $b$ on each column of $A$. now we want to know which linear combination gives the same dot products. So, we multiply $A^Tb$ by $(A^TA)^{-1}$.
Now we have the coefficients of each column for the linear combination which has the same dot products on the columns of $A$ as $b$. we should simply multiply $(A^TA)^{-1}A^Tb$, which was derived in the previous paragraph, by $A$, because we have the coefficients, so multiply each coefficient to each corresponding column and add them up, thats exactly what $A(A^TA)^{-1}A^Tb$ does.