The partial derivatives of $|Ax|^2$ with respect to $x_1, ... , x_n$ fill the vector $2A^T Ax$. The derivatives of $2b^T Ax$ fill the vector $2A^T b$. So the derivatives of $|Ax-b|^2$ are zero when __________________
From the least square fitting I understand that derivatives of $|Ax-b|^2$ are zero when $A^TAx=A^Tb$ where the idea is to minimise errors. But, how do I make sense of the first two statements in the given question:
"The partial derivatives of $|Ax|^2$ with respect to $x_1, ... , x_n$ fill the vector $2A^T Ax$" and "The derivatives of $2b^T Ax$ fill the vector $2A^T b$" ?
$$ |Ax-b|^2=(Ax-b)^T(Ax-b)=(x^TA^T-b^T)(AX-b)\\=x^TA^TAx-x^TA^Tb-b^TAx-b^Tb\\ =|Ax|^2-b^TAx-(b^TAx)^T-|b|^2 $$
Reference: 24, 4.3, Chapter 4 : Orthogonality, Introduction to Linear Algebra, Gilbert Strang
Basically, the problem here is to minimize the function $f = \| Ax -b\|^2$, which can be written as
$$ \min_{x} \| Ax -b\|^2 $$
The minimum value of $f$ is reached at $\hat{x}$, which is also the point for which the gradient (the first derivative) of $f$ w.r.t to $x$ equals to zero, that is:
$$ \frac{\partial \| Ax -b\|^2}{\partial x} = 0$$
For simplicity, we use the Frobenius product notation, and then we differentiate w.r.t $x$
\begin{equation} \begin{split} y & = Ax-b \\ dy & = Adx \\ f & = \|y\|^2 \\ & = y:y \\ df & = dy:y + y:dy \\ & = 2y:dy \\ & = 2y:Adx \\ & = 2A^Ty:dx \\ \frac{df}{dx} & = 2A^Ty\\ & = 2A^T(Ax -b)\\ \end{split} \end{equation}
Now, we set the derivative equal to zero
\begin{equation} \begin{split} \frac{df}{dx} & = 0 \\ 2A^T(Ax -b) & = 0\\ 2A^TAx - 2A^Tb & = 0 \\ \implies 2A^TAx & = 2A^Tb \\ \implies \hat{x} & = (A^TA)^{-1}A^Tb \\ \end{split} \end{equation}