I know this is is a well known fact linear algebra (and feel should be super obvious) but I got a little stuck.
I was to prove that choosing an $x \in \mathbb{R}^K$ such that $A^T(b - Ax) = 0$ is optimal when minimizing $\| b - Ax \|^2$. The intuition is obvious, we want the error vector $b - Ax$ to perpendicular to the column space of A i.e. $$(b - Ax) \perp C(A)$$ but for some reason, the mathematics didn't come out right.
This is what I have:
$$ \min_{x \in \mathbb{R}^K} \|Ax - b \|^2$$
we know:
$\|Ax - b \|^2 = (Ax -b)^T(Ax - b) = \| Ax \|^2 + \| b \|^2 - 2(Ax)^Tb = x^T A^T Ax - 2x^T A^T b + b^Tb$
So it suffices to minimize $x^T A^T Ax - 2x^T A^T b = x^T (A^T A x - 2A^T b)$ since $b^Tb$ is independent of our choice of $x$. Ideally we would want this to be zero:
$$x^T (A^T A x - 2A^T b) = 0$$
which seems to be possible since the minimization doesn't say that $x = 0$ is illegal. Which obviously $x = 0$ has to be wrong (for some reason) and doesn't match the initial intuition that I have.
Even if we didn't choose such an $x$ (which is wrong for some reason) we could also say that $x^T (A^T A x - 2A^T b) = 0$ means that either $x = 0$ or $A^T A x - 2A^T b = 0$. If we choose the second one we get something very close to the desired result:
$A^T A x - 2A^T b = A^T (A x - 2b) = 0$
which is nearly exactly what I wanted (we wanted $A^T(b - Ax) = 0$) but it seems its off by a factor of 2.
I thought the mistake of factor of 2 would have come from saying $ - (Ax)^T b - b^T (Ax) = -(Ax)^T b $ but I think that statement is correct. Or did I make some other mistake somewhere else?
So my issues are:
- Why is my reasoning for choosing $x = 0$ is wrong?
- Why is my equation $A^T (A x - 2b) = 0$ off by a factor of 2 with $b$.
- Even after those issue are resolved, how does one show optimality when $A^T(Ax - b ) = 0$
We can first prove a general lemma:
Lemma: Let $b \in \mathbb R^m$ and let $V$ be a subspace of $\mathbb R^m$. Suppose $\hat b \in V$ and $\langle b - \hat b, v \rangle = 0$ for all $v \in V$. Then $\| b - \hat b \|_2 \leq \| b - v \|_2$ for all $ v\in V$.
Proof: Let $v \in V$. Then $\|b - v \|_2^2 = \| b - \hat b + \hat b - v \|_2^2 = \|b - \hat b \|_2^2 + \|\hat b - v \|_2^2 \geq \| b - \hat b \|_2^2.$ (The second equality follows from the Pythagorean theorem, which is applicable because $b - \hat b \perp \hat b - v$.)
Having established this lemma, the intuition you gave is now a rigorous proof. If $A^T(b - Ax^\star) = 0$, then the vector $b - A x^\star$ is orthogonal to every vector in the range of $A$. It follows that $\| b - A x^\star \|_2 \leq \| b - Ax \|_2$ for all $x$.