So I am currently re-studying linear regression and wanted to explore the topic more thoroughly than before.
The prediction formula for linear regression is:
$$\hat Y = X \hat \theta $$
Now, the normal equation is:
$$ \hat \theta= (X^TX)^{-1}X^T Y$$
So far everything is good, but while trying to see what happens when I linearly transform $X_2=AX$, I have noticed, that both predictors are actually equal to the data! In fact, after plugging the normal equation into the prediciton equation:
$$\hat Y = X \theta = X (X^TX)^{-1}X^T Y = X X^{-1} X ^{-T} X^T Y = I I Y = Y$$
Which would mean the prediction is exactly equal to the data. Now of course this is not the case and of course I am missing something, but I cannot really find a hole here.
This issue is that with linear regression, for there to exist a least-squares solution, the coefficient matrix $X$ will not have full rank. That is, $X$ in general will have many more rows than columns. In other words, the model parameters will be much fewer in number than the number of data points to fit the model. So, the step of simplifying $XX^{-1} = I$ is incorrect. $X$ will not have an inverse.
Linear regression with $n$ variables will only be exact if $X$ is full rank, that is, if I'm only fitting the model with $n+1$ data points. But, if I have more than $n+1$ data points that have some degree of scatter, the matrix $X$ will not be full rank, and the fitted model will act to minimize the sum of squares of errors.