Least squares: How does the projection theorem lead to the normal equations?

441 Views Asked by At

I understand why $\vec{{P}}$ (the orthogonal projection of $\vec{{b}}$ on w)
is the closest vector to $\vec{{b}}$ (The vector that does not give us a solution to the
equation and that's why we need to do the process)

But I do not understand the equation:
$$A^TA\vec{{x}}=A^T \vec{{b}}$$ Why does multiplying each side of the original equation $A\vec{{x}}=\vec{{b}}$ (We know she has no answer) by $A^T$ give us the closest answer?
Thank you

3

There are 3 best solutions below

5
On BEST ANSWER

Given vector $\vec b$ and matrix $A$, the goal is to find $\vec x$ so that $A\vec x$ is as close to $\vec b$ as possible.

Let $W$ be the subspace of all vectors of the form $A\vec x$. Observe that $W$ is the set of all linear combinations of columns of $A$.

We are looking for the vector in $W$ that is closest to $\vec b$. Write $A\hat x$ for this vector. By the projection theorem, $A\hat x$ is the orthogonal projection of $\vec b$ onto $W$. Equivalently, $A\hat x-\vec b$ is orthogonal to every vector in $W$. As a special case, we conclude:

$$ \text{$A\hat x-\vec b$ is orthogonal to every column of matrix $A$}. \tag{*}$$

Constraint (*) can be written in matrix form as: $$ A^T(A\hat x-b) = 0,$$ which leads to the equation $A^TA\hat x=A^T\vec b$.

0
On

I'm not able to comment, but my answer to: How to formulate ordinary least squares regression in component formalism? may address your question. The reason multiplying each side by $A^T$ helps is that while $A$ may not have an inverse, $A^TA$ typically does, allowing you to solve for $\vec{x}$.

0
On

@nosuchthingasmagic gives a link to a question that gives a calculus-based derivation. Here's another one.

The standard notation is $\vec y = X \vec {\beta}$. Then we note that this has no exact solution, so really $X \vec {\beta} = \vec y + \vec {\epsilon}$. If $\vec \beta$ is optimal, then $X^T \vec {\epsilon}=\vec 0$. So when we multiply both sides by $X$, $\epsilon$ disappears and we're left with $X^TX \vec {\beta} = X^T\vec y$.

Why $X^T \vec {\epsilon}=\vec 0$? Well, that's saying that every row of $X^T$ is orthogonal to $\vec {\epsilon}$. And the rows of $X^T$ represent all the observations of a particular feature. So this is saying that for each feature, the observations of that feature are uncorrelated with the errors. If there were a correlation, we could adjust the corresponding $\beta_i$ to get a better fit.