I understand why $\vec{{P}}$ (the orthogonal projection of $\vec{{b}}$ on w)
is the closest vector to $\vec{{b}}$ (The vector that does not give us a solution to the
equation and that's why we need to do the process)
But I do not understand the equation:
$$A^TA\vec{{x}}=A^T \vec{{b}}$$
Why does multiplying each side of the original equation $A\vec{{x}}=\vec{{b}}$ (We know she has no answer)
by $A^T$ give us the closest answer?
Thank you
Given vector $\vec b$ and matrix $A$, the goal is to find $\vec x$ so that $A\vec x$ is as close to $\vec b$ as possible.
Let $W$ be the subspace of all vectors of the form $A\vec x$. Observe that $W$ is the set of all linear combinations of columns of $A$.
We are looking for the vector in $W$ that is closest to $\vec b$. Write $A\hat x$ for this vector. By the projection theorem, $A\hat x$ is the orthogonal projection of $\vec b$ onto $W$. Equivalently, $A\hat x-\vec b$ is orthogonal to every vector in $W$. As a special case, we conclude:
$$ \text{$A\hat x-\vec b$ is orthogonal to every column of matrix $A$}. \tag{*}$$
Constraint (*) can be written in matrix form as: $$ A^T(A\hat x-b) = 0,$$ which leads to the equation $A^TA\hat x=A^T\vec b$.