Linear regression prediction equal to data when using normal equation?

36 Views Asked by At

So I am currently re-studying linear regression and wanted to explore the topic more thoroughly than before.

The prediction formula for linear regression is: $$\hat Y = X \hat \theta $$
Now, the normal equation is:
$$ \hat \theta= (X^TX)^{-1}X^T Y$$ So far everything is good, but while trying to see what happens when I linearly transform $X_2=AX$, I have noticed, that both predictors are actually equal to the data! In fact, after plugging the normal equation into the prediciton equation:
$$\hat Y = X \theta = X (X^TX)^{-1}X^T Y = X X^{-1} X ^{-T} X^T Y = I I Y = Y$$
Which would mean the prediction is exactly equal to the data. Now of course this is not the case and of course I am missing something, but I cannot really find a hole here.

2

There are 2 best solutions below

1
On BEST ANSWER

This issue is that with linear regression, for there to exist a least-squares solution, the coefficient matrix $X$ will not have full rank. That is, $X$ in general will have many more rows than columns. In other words, the model parameters will be much fewer in number than the number of data points to fit the model. So, the step of simplifying $XX^{-1} = I$ is incorrect. $X$ will not have an inverse.

Linear regression with $n$ variables will only be exact if $X$ is full rank, that is, if I'm only fitting the model with $n+1$ data points. But, if I have more than $n+1$ data points that have some degree of scatter, the matrix $X$ will not be full rank, and the fitted model will act to minimize the sum of squares of errors.

0
On

In linear regression, the standard setup is that we have $n$ observations $(\mathbf x_i,y_i)$ where $\mathbf x_i \in\mathbb R^p$ is a vector of parameters and $y_i\in\mathbb R$ is the "response" or "output" corresponding to $\mathbf x_i$. Furthermore, we assume that the following relationship holds for all $i$ : $$y_i = \mathbf x_i^T\boldsymbol\theta^* +\varepsilon_i $$ Where $\varepsilon_1,\ldots,\varepsilon_n$ are (typically) i.i.d. $\mathcal N(0,\sigma^2)$ distributed and $\boldsymbol\theta^*\in\mathbb R^p$ is a vector of parameters we want to estimate. This estimation is typically done by solving $\min_{\boldsymbol \theta}\sum_{i=1}^n (y_i - \mathbf x_i^T\boldsymbol\theta)^2$. This can be equivalently rewritten in matrix form as $$\min_{\boldsymbol \theta}\|\mathbf y - \mathbf X\boldsymbol\theta\|^2 \tag1$$

where $\mathbf y:=[y_1\ \ldots\ y_n]^T\in\mathbb R^n$ and $\mathbf X = [\mathbf x_1\ \ldots\ \mathbf x_p]\in\mathbb R^{n\times p}$. Provided that $\mathbf X$ is full rank, equation $(1)$ leads to the normal equations and the solution $\hat{\boldsymbol\theta}$ you've given in your question.

So what is wrong with your computation ? It is that when you wrote "$(\mathbf X^T\mathbf X)^{-1} = \mathbf X^{-1}\mathbf X^{-T} $", you implicitly assumed that $\mathbf X$ is invertible, but unless $n=p$, that is not the case in general ! In fact, in classical linear regression, we usually assume an overdetermined system, i.e. $n>p$.


As a side note, when $X$ is invertible, it is not surprising that you recover the exact solution $\boldsymbol \theta$, as you have a system of $n$ equations and $n$ unknowns. That being said, the $\mathbf {\hat y}$ you get is not the "true" value which generated the observation but rather that value offset by the measurement error $\boldsymbol \varepsilon$.