Understanding the concept of $(X^TX)^{-1}X^Ty$

2.7k Views Asked by At

$w = (X^TX)^{-1}X^Ty$

Equation above produces weight $w$ which solves quadratic minimization problem in linear functions.

From my understanding, the goal is to find $w$ such that:

$Xw ≈ y$

Thus, I need to minimize this function:

$||Xw-y||^2$

According to wikipedia, Euclidean norm is used to minimize this function:

$2X^T(Xw - y) = 0$

Then this equation is differentiated with respect of $w$, from my understanding we use multivariable chain rule:

$X^TXw-X^Ty=0$

$X^TXw=X^Ty$

$w=(X^TX)^{-1}(X^Ty)$

Somehow, by utilizing Euclidean norm, I minimized the function, but I'm unable to understand how does it exactly work.

Why is Euclidean norm used to solve quadratic minimization problems in linear functions?

3

There are 3 best solutions below

4
On BEST ANSWER

The best way to explain, in my opinion, is by projection.

Since $Xw = y$ has not an exact solution we look for $Xw = \bar y$ where $\bar y$ is the projection of $y$ in $Col(X)$.

The error is $e=y-\bar y=y-Xw$ and it is miminized when $e$ is orthogonal to $Col(X)$ that is

$$X^Te=X^T(y-Xw)=0\implies X^Ty=X^TXw\implies w=(X^TX)^{-1}X^Ty$$

2
On

I'm not sure if this is what you are after, but perhaps worth pointing out:

You can ask the same question for any norm you like, not just the Euclidean one. The calculations are most convenient when the norm comes from an inner product. What you are minimizing is $\|Xw-y\|$, so changing the norm often changes the minimizer $w$. What changes for different norms is that the transpose is replace by the adjoint with respect to the inner product.

If the inner product on the domain is given by a positive definite matrix $P$ and the one the target side by $Q$, then the minimizer is $w=(X^TQX)^{-1}X^TQy$, which is independent of $P$, since we are only using the norm in the target space. The Euclidean case $P=I$ and $Q=I$.

1
On

The Euclidean norm is the quadratic norm induced by the scalar product $$<x,y> = \sum_{k=1}^nx_k y_k \mbox{ for } x= (x_1 \ldots x_n)^T, \, y=(y_1 \ldots y_n)^T$$ $$ \Rightarrow ||x-y||^2 =<x-y,x-y> = \sum_{k=1}^n(x_k-y_k)^2$$

So, it is no surprise that quadratic optimization problems involve the concept of orthogonality.

For example, the best quadratic approximation $y_U$ within a subspace $U$ for a vector $y$ is the orthogonal projection $P_Uy$ of $y$ onto $U$, because for any $x \in U$ we have: $$ ||x - y||^2 = ||x - P_Uy + P_Uy - y||^2 \stackrel{(P_Uy - y)\perp U}{=}||x - P_Uy||^2 + ||P_Uy - y||^2 \geq ||P_Uy - y||^2$$ So, the quadratic minimum is attained for $$y_U = P_Uy$$ In your example you need only to know that $X(X^TX)^{-1}X^T = P_U$ is the orthogonal projector onto the subspace $U$ spanned by the columns of $X$ (You may find this for example here.):

$$y_U = P_Uy = X(X^TX)^{-1}X^Ty = Xw \mbox{ where } w= (X^TX)^{-1}X^Ty$$