Conceptual similarity between the normal equation in linear regression to find best parameters and an orthagonal projection in linear algebra?

31 Views Asked by At

So Im taking linear algebra and this machine learning course at the same time and notice that an orthogonal projection in linear algebra formula where A and b are matrices:

$$(^)^{−1}^$$

Seems very similar to the normal equation in linear regression where the formula to find the best parameter that minimizes the cost function is:

$$(X^TX)^{-1}X^Ty $$

I just found it interesting and was wondering if anyone can explain to me the conceptual foundation for the similarities between them and why they would be similar.

1

There are 1 best solutions below

0
On BEST ANSWER

$$ y_i = a + bx_{1,i} + cx_{2,i} + \text{“error”}_i $$

In ordinary least squares one seeks the values of $\widehat a,\widehat{b\,}, \widehat{c\,}$ that, when put in the roles of $a,b,c$ minimize the sum of squares of the residuals $\widehat{y\,}_i - y_i$ where the fitted values $\widehat{y\,}_i$ are given by $$\widehat{y\,}_i = \widehat a + \widehat{b\,} x_{1,i} + \widehat{c\,}x_{2,i}. \tag 1$$

Since the vector $\widehat{\mathbf y} = \big( \widehat{y\,}_1, \ldots, \widehat{y\,}_n \big)^\top$ is thus closer to $\mathbf y = \big(y_1,\ldots,y_n\big)^\top$ than is any other vector whose components can be expresses by $(1),$ $\widehat{\mathbf y}$ is therefore the orthogonal projection of $\widehat{y\,}$ onto the space spanned by $\big(1,\ldots,1\big)^\top$ (which will be multiplied by $\widehat a\,$), $\big(x_{1,1},\ldots,x_{1,n}\big)^\top,$ and $\big(x_{2,1},\ldots,x_{2,n}\big)^\top.$

The "design matrix" (something of a misnomer) $X$ is the matrix whose columns are those vectors that span the space onto which $\mathbf y$ gets projected. Thus we have $$ X\mathbf{\widehat a} = \widehat{y\,} = X(X^\top X)^{-1}X^\top \mathbf y. $$ One may be tempted to multiply both sides of $X\mathbf{\widehat a} = X(X^\top X)^{-1}X^\top \mathbf y$ on the left by $X^{-1},$ but since $X$ is a tall skinny matrix (i.e. has many more rows that columns) it doesn't have an inverse of the kind first considered in linear algebra courses. If the columns of $X$ are linearly independent (as is typical in these problems) $X$ does, however, have a left inverse, which is $(X^\top X)^{-1}X^\top.$ Multiply both sides of that equality on the left by that matrix and you get $$ \widehat{\mathbf a} = (X^\top X)^{-1}X^\top\mathbf y. $$