Understanding the concept of $(X^TX)^{-1}X^Ty$

Question

Understanding the concept of $(X^TX)^{-1}X^Ty$

2.7k Views Asked by Bumbble Comm At 14 May 2026 - 11:38

$w = (X^TX)^{-1}X^Ty$

Equation above produces weight $w$ which solves quadratic minimization problem in linear functions.

From my understanding, the goal is to find $w$ such that:

$Xw ≈ y$

Thus, I need to minimize this function:

$||Xw-y||^2$

According to wikipedia, Euclidean norm is used to minimize this function:

$2X^T(Xw - y) = 0$

Then this equation is differentiated with respect of $w$, from my understanding we use multivariable chain rule:

$X^TXw-X^Ty=0$

$X^TXw=X^Ty$

$w=(X^TX)^{-1}(X^Ty)$

Somehow, by utilizing Euclidean norm, I minimized the function, but I'm unable to understand how does it exactly work.

Why is Euclidean norm used to solve quadratic minimization problems in linear functions?

Original Q&A

There are 3 best solutions below

Bumbble Comm On 28 Mar 2018 - 9:44

I'm not sure if this is what you are after, but perhaps worth pointing out:

You can ask the same question for any norm you like, not just the Euclidean one. The calculations are most convenient when the norm comes from an inner product. What you are minimizing is $\|Xw-y\|$, so changing the norm often changes the minimizer $w$. What changes for different norms is that the transpose is replace by the adjoint with respect to the inner product.

If the inner product on the domain is given by a positive definite matrix $P$ and the one the target side by $Q$, then the minimizer is $w=(X^TQX)^{-1}X^TQy$, which is independent of $P$, since we are only using the norm in the target space. The Euclidean case $P=I$ and $Q=I$.

Bumbble Comm On 28 Mar 2018 - 11:00

The Euclidean norm is the quadratic norm induced by the scalar product $$<x,y> = \sum_{k=1}^nx_k y_k \mbox{ for } x= (x_1 \ldots x_n)^T, \, y=(y_1 \ldots y_n)^T$$ $$ \Rightarrow ||x-y||^2 =<x-y,x-y> = \sum_{k=1}^n(x_k-y_k)^2$$

So, it is no surprise that quadratic optimization problems involve the concept of orthogonality.

For example, the best quadratic approximation $y_U$ within a subspace $U$ for a vector $y$ is the orthogonal projection $P_Uy$ of $y$ onto $U$, because for any $x \in U$ we have: $$ ||x - y||^2 = ||x - P_Uy + P_Uy - y||^2 \stackrel{(P_Uy - y)\perp U}{=}||x - P_Uy||^2 + ||P_Uy - y||^2 \geq ||P_Uy - y||^2$$ So, the quadratic minimum is attained for $$y_U = P_Uy$$ In your example you need only to know that $X(X^TX)^{-1}X^T = P_U$ is the orthogonal projector onto the subspace $U$ spanned by the columns of $X$ (You may find this for example here.):

$$y_U = P_Uy = X(X^TX)^{-1}X^Ty = Xw \mbox{ where } w= (X^TX)^{-1}X^Ty$$

**Bumbble Comm** · Accepted Answer

The best way to explain, in my opinion, is by projection.

Since $Xw = y$ has not an exact solution we look for $Xw = \bar y$ where $\bar y$ is the projection of $y$ in $Col(X)$.

The error is $e=y-\bar y=y-Xw$ and it is miminized when $e$ is orthogonal to $Col(X)$ that is

$$X^Te=X^T(y-Xw)=0\implies X^Ty=X^TXw\implies w=(X^TX)^{-1}X^Ty$$

Understanding the concept of $(X^TX)^{-1}X^Ty$

There are 3 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in LEAST-SQUARES

Trending Questions

Popular # Hahtags

Popular Questions