I have attempted to explain why Least Square Regression is "leagal" in many ways. It is the outcome of MLE, if errors are normal random variables.
I have also find that MLE is closed related to orthogonal projections. (Square is related to modulus!) However, I have already forgot how to explain Least Square Regression in terms of orthogonal projections. Can anyone explain that?
Also, are there other ways to explain why Least Square is appropriate?
The regression problem is basically a problem of finding the "best" solution of over-determined system. I.e., you have $n$ equations with $p+1$ coefficients. Assuming that the sample size $n$ is much larger then the number of explanatory variables. Then for every $p+1$ equations you will get different set of $\beta$s. So, instead of finding the coefficients w.r.t. $x_j$ for every $p+1$ of observations, you are trying to find the coefficients that will be the "closest" to the best $\beta$ for all the observations. I.e., instead of expressing the dependent variable $y$ with the columns of $X$ (columns of $X = C(X)$), you are searching for the closest vector to $y$ in $\operatorname{sp} \{C(X)\}$, which is the OLS fitted values. As such, the "closest" vector to $y$ in $\operatorname{sp}\{C(X)\}$ is the orthogonal projection of $y$ onto $C(X)$.
Recall that the OLS (estimated) coefficients are $$ \hat{\beta} = (X^TX)^{-1}X^Ty. $$ Thus, the fitted $y$, which denoted by $\hat{y}$, is given by $$ \hat{y} = X\hat{\beta} = X(X'X)^{-1}X'y=Hy. $$ Where $H$ called the "hat matrix" is the orthogonal projection onto $C(X)$. Note that $C(X)$ span an affine space, thus the OLS coefficients are the coordinates of the projected $y$ w.r.t to the columns of $X$.