Why does $A^TAw = A^Tb$ describe the least square solution to a multivariate system?

Question

Why does $A^TAw = A^Tb$ describe the least square solution to a multivariate system?

66 Views Asked by Bumbble Comm At 30 Mar 2026 - 7:07

Note that this builds off a previous question, found here.

Suppose I have a system of equations of the form:

$$Wv_1 + Xv_2 + Yv_3 + Zv_4 = b$$

Where $v_1$, $v_2$, $v_3$, $v_4$, and $b$ are vectors. I want to determine the least squares solution for unknowns $W,X,Y,Z$. As I understand it, the solutions are the vector $w$ in:

$$A^{T}Aw = A^{T}b$$

Where $A$ is the matrix who's columns are $v_1$, $v_2$, $v_3$, $v_4$.

It isn't clear to me, however, why this is. Why is $w$ the least-squares solution, mathematically and intuitively? I've tried to do some research online, however I haven't found much (and what I have found uses extensively terminology with which I'm unfamiliar— unfortunately, this isn't a subject I'm particularly familiar with).

Original Q&A

There are 2 best solutions below

user21725 On 15 Dec 2020 - 1:16

If you have $Ax =b$ then the least squares solution can be taken from the normal equations which are given by multiplying by the transpose on each side. This makes a square matrix.

$$ A^TAx = A^Tb$$

Now the least squares problem is

$$ f(x) = \| b - Ax \|_{2}^{2} = (b-Ax)^T (b-Ax) = b^Tb - x^TA^Tb - b^TAx + x^TA^TAx$$

this comes from how $\| \cdot \|$ works. $ b - Ax = r$ which is the residual and $\| r \|_{2}^{2} = rr^{T}$ so they are attempting to minimize the residual. You can distribute $(b-Ax)^T(b-Ax)$ and you'll get that above.

How is it achieved? If $f(x)$ is a global minimum then $ \nabla f(x) = 0$ and

$$ \nabla (x^TA^Tb) = A^Tb \\ \nabla(b^TAx) = A^Tb \\ \nabla (x^TA^TAx) = 2A^TAx$$

which gives us $$ \nabla f(x) = 2A^TAx - 2A^Tb$$

so we solve that for 0

**Bumbble Comm** · Accepted Answer

Consider $A\in\mathbb{R^{m\times n}}$ and $b\in\mathbb{R^{m}}$ The least squares problem defined as seeking $x\in\mathbb{R^{n}}$ so that $\|b-A x\|_{2}=\min _{y \in \mathbb{R}^{n}}\|b-A y\|_{2}$ is equivalent to minimizing the function $\Phi(x)=\|b-A x\|_{2}^{2}$ over $\mathbb{R}^{n}$ such that $$ \begin{aligned} \Phi(.) &: \quad \mathbb{R}^{n} \rightarrow \mathbb{R}^{+} \cup\{0\} \\ \Phi(x) &=\|b-A x\|_{2}^{2}=(b-A x)^{T}(b-A x)=b^{T} b-b^{T}(A x)-(A x)^{T} b+(A x)^{T}(A x) \\ &=\|b\|_{2}^{2}-2(A x)^{T} b+x^{T} A^{T} A x \\ &=\|b\|_{2}^{2}+2 \Psi(x) \end{aligned} $$ where $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^{m}$ are given. Thus $,\|b\|_{2}^{2}$ is a constant and minimizing $\Phi(x)$ is equivalent to minimizing $$ \Psi(x)-\frac{1}{2} x^{T} A^{T} A x-x^{T} A^{T} b $$ To prove this equivalence, we note that : \begin{aligned} \Psi(x) &=& \frac{1}{2}(A x)^{T}(A x)-x^{T}\left(A^{T} b\right)=\frac{1}{2} y^{T} y-x^{T} c & & \text { where } y=A x \\ &=& \frac{1}{2} \sum_{i=1}^{m} y_{i}^{2}-\sum_{j=1}^{n} c_{j} x_{j} \\ &=& \frac{1}{2} \sum_{i=1}^{m}\left(\sum_{j=1}^{n} A(i, j) x_{j}\right)^{2}-\sum_{j=1}^{n} c_{j} x_{j} & & \text { since } y_{i}=\sum_{j=1}^{n} A(i, j) x_{j} \end{aligned} Thus, $\Psi(x)=\Psi\left(x_{1}, x_{2}, \cdots, x_{n}\right)$ is a quadratic function of the unknowns $x_{1}, x_{2}, \cdots, x_{n} .$ To find the minimum of $\Psi(x)$ it should be such that : $$ \nabla \Psi(x)=\left[\begin{array}{c} \frac{\partial \Psi}{\partial x_{1}}(x) \\ \frac{\partial \Psi}{\partial x_{2}}(x) \\ \vdots \\ \frac{\partial \Psi}{\partial x_{n}}(x) \end{array}\right]=0 $$ $$ \begin{aligned} \Longleftrightarrow\frac{\partial \Psi}{\partial x_{k}}(x) &=\sum_{i=1}^{m} A(i, k)\left(\sum_{j=1}^{n} A(i, j) x_{j}\right)-c_{k}=0, \quad \text { for } \quad k=1,2, \cdots, n \\ &=\sum_{i=1}^{m} A(i, k)(A(i,:) x)-c_{k}=0 \\ &=\sum_{i=1}^{m} A^{T}(k, i) y_{i}-c_{k}=0 \\ &=A^{T}(k,:) y-c_{k}=0 \end{aligned} $$

$$ $$

$$ \therefore\left\{\begin{array}{l} A^{T}(1,:) y=c_{1} \\ A^{T}(2,:) y=c_{2} \\ \vdots \\ A^{T}(n,:) y=c_{n} \end{array} \quad \Longleftrightarrow \quad A^{T} y=c \quad \Longleftrightarrow \quad A^{T} A x=A^{T} b\right.$$

Now likewise, We will show that if $x$ solves the normalized system $A^{T} A x=A^{T} b,$ then it solves the Least squares problem. For any $y \in \mathbb{R}^{n}$ we have that $$ \begin{aligned} \Psi(x)-\Psi(y) &=\frac{1}{2} x^{T} A^{T} A x-x^{T}\left(A^{T} b\right)-\frac{1}{2} y^{T} A^{T} A y+y^{T}\left(A^{T} b\right) \\ &=\frac{1}{2} x^{T} A^{T} A x-x^{T}\left(A^{T} A x\right)-\frac{1}{2} y^{T} A^{T} A y+y^{T}\left(A^{T} A x\right) \\ &=-\frac{1}{2} x^{T} A^{T} A x-\frac{1}{2} y^{T} A^{T} A y+y^{T} A^{T} A x \\ &=-\frac{1}{2}(x-y)^{T} A^{T} A(x-y) \\ &=-\frac{1}{2}\|A(x-y)\|_{2}^{2} \leqslant 0 \\ \therefore \Psi(x) & \leqslant \Psi(y) \quad \text { for any } y \in \mathbb{R}^{n} \end{aligned} $$ Thus the existence of solution/s to the least squares problem reduces to studying the existence of solution/s to the system $$ A^{T} A x=A^{T} b $$ where $G=A^{T} A$ is an $n \times n$ symmetric positive semi-definite and $\operatorname{rank}\left(A^{T} A\right)=\operatorname{rank}(A)=r \leqslant \min (m, n)$

Why does $A^TAw = A^Tb$ describe the least square solution to a multivariate system?

There are 2 best solutions below

Related Questions in REGRESSION

Related Questions in LEAST-SQUARES

Related Questions in LINEAR-REGRESSION

Trending Questions

Popular # Hahtags

Popular Questions