Does the exact solution for $A^TAx = A^Ty$ minimize the least squares error for $Ax = y$?

857 Views Asked by At

The Eigen library’s documentation on how to solve linear least squares shows three examples: https://eigen.tuxfamily.org/dox/group__LeastSquares.html

Using SVD decomposition, using QR decomposition and using normal equations.

If I have an overconstrained problem $Ax = y$ (more data points y than variables x) then I can transform it into a problem that isn’t overconstrained by multiplying the transpose of A to the left of both sides of the equation $A^TAx = A^Ty$ this problem has the same amount of datapoints $y$ as variables to fit and I can calculate an exact solution for $x$.

Is this solution x really also the one that minimizes the squared error of the original problem? How is this possible? Both the matrix $A^TA$ as well as the vector $A^Ty$ are smaller than the ones of the original problem and thus contain less information.

Or is the Eigen documentation misleading and this doesn’t actually yield an optimal solution in the least squares sense (numerical stability aside)?

2

There are 2 best solutions below

1
On

Is this solution x really also the one that minimizes the squared error of the original problem?

Yes, as I'll show below.

Let $f(x) = (1/2) \| Ax - y \|^2$. We can minimize $f$ by setting the gradient of $f$ equal to $0$. But, as has been shown in many threads on least squares, the gradient of $f$ is $\nabla f(x) = A^T(Ax-y)$. So we obtain $$ \nabla f(x) = A^T(Ax-y) = 0 \implies A^TAx = A^Ty. $$ This linear system of equations is called the "normal equations".


By the way, it's easy to compute $\nabla f(x)$ using the multivariable chain rule. Notice that $f(x) = g(h(x))$, where $$ h(x) = Ax - y \quad \text{and} \quad g(u) = \frac12 \|u\|^2. $$ The derivatives of $h$ and $g$ are $$ h'(x) = A, \qquad g'(u) = u^T. $$ By the chain rule, $$ f'(x) = g'(h(x)) h'(x) = (Ax - y)^T A. $$ If we use the convention that $\nabla f(x)$ is a column vector, then $$ \nabla f(x) = f'(x)^T = A^T (Ax - y). $$

0
On

You want to minimize, let $x$ be the eventual solution from your method, let $t$ be a scalar variable, so that a slight change of solution is $x + t w$ where $w$ is also a vector. Minimize scalar $$ ( A(x+tw) - y)^T ( A(x+tw) - y) $$ $$ x^TA^T Ax + 2t x^T A^T A w + t^2 w^T A^T A w - 2 y^T Ax - 2ty^T Aw - y^T y $$ Looks good for a first attempt. Uses the fact that, given symmetric $C,$ the scalar $u^T C v = v^T C u $ is its own transpose. Collect powers of little $t$ $$ x^TA^T Ax -2 y^T Ax - y^T y+ t \left(2 x^T A^T A w - 2y^T Aw \right) + t^2 w^T A^T A w $$ Well, out $x$ achieves a minimum if the $t$ linear coefficient is $0$ regardless of choice of $w.$ Minimum if we always have $$ \left(2 x^T A^T A - 2y^T A \right) w = 0 $$ For this to become zero for every $w$ we must have $$ 2 x^T A^T A - 2y^T A = 0 $$ Transpose for $$ A^T A x = A^T y $$