I'm trying to understand least squares through linear algebra and projection. Specifically the normal equations and $A^{T}Ax = A^{T}b$. I can work through several lines of reasoning including the calculus derivation and simply solving the equations $Ax + e = b$ where $A^{T}(b - Ax) = 0$. The logic behind both these are clear. I am trying to understand why simply multiplying both sides of the equation $Ax = b$ by $A^{T}$ gives the right answer. I came across this question/answer which really helped, but still left me somewhat confused. My understanding so far amounts to this:
The vector $b$ can be broken down into two components, one that is in the column space of $A$ and another that is in the left null space of A, $N(A^{T})$. We are interested in solving $A\hat{x} = b^{C(A)}$, where $b^{C(A)}$ is the projection of $b$ onto the column space of $A$. In fact, if we construct a matrix $[N \space A]$ where $N$ is a basis for $N(A^{T})$ and solve $[N \space A]x = b$ the components of $x$ that multiply the $A$ block of the matrix are the least squares estimates.
When we multiply $b$ by $A^{T}$, the component of $b$ in the left null-space is projected to zero, under the linear transformation of $A^{T}$ both $b$ and $b^{C(A)}$ are mapped to the same vector (I am assuming here that $m > n$ for $A$).
Because linear transformations preserve $L(ab + cd) = aL(b) + cL(d)$ when we solve for $\hat{x}$ after applying $A^{T}$ the same vector $\hat{x}$ should also solve the system $A\hat{x} = b^{C(A)}$.
Number 3 above is where my understanding is a little shaky. I don't necessarily understand how these transformations that aren't one to one work. My best guess is that if we were to somehow undo the $A^{T}$ transformation we would map the zero vector back to the zero vector and the vector $b$ would be mapped back to $b^{C(A)}$. This would reason that solving $A^{T}A\hat{x} = A^{T}b$ is equivalent to solving $A\hat{x} = b^{C(A)}$, which turns out to be the correct method. This is really the crux of my question, there are infinitely many vectors the application of $A^{T}$ maps to the same vector, I don't understand why solving $A^{T}A\hat{x} = A^{T}b$ finds the right $\hat{x}$ for the vector we're looking for in the higher dimensional space before we apply the transformation.
Thanks