In linear regression, $$y = X \beta + \epsilon$$ where $y$ is a $n \times 1$ vector of observations for the response variable,
$X = (x_{1}^{T}, ..., x_{n}^{T}), x_{i} \in \mathbb{R}^p. i =1,...,n$ is a data matrix of $p$ explanatory variables, and $\epsilon$ is a vector of errors.
Further, assume that $\mathbb{E}[\epsilon_i] = 0$ and $var(\epsilon_i) = \sigma^2, i=1,...n$
The least-squares estimate, $$\hat{\beta} = (X^{T}X)^{-1}X^{T}y$$
The least-squares estimators are the fitted values, $$\hat{y} = X \hat{\beta} = X(X^{T}X)^{-1}X^{T}y = X C^{-1}X^{T}y = Py$$
$P$ is a projection matrix. It is has the following properties:
- idempotent, meaning P*P = P
- symmetric
- positive semi-definite
For property 1, what's the intuition behind this? How can you take some matrix do transformation, inverse and multiplication, then, you get idempotent. It's an important concept. But, it's hard to follow through the math to get an intuition.
Why will we get property 2 and property 3, How am I supposed to think about this?
Basically when you have $n$ observations with $p$ unknown coefficients where $n> p$, it means that you have an over-determined system of equations, that is $$ X\beta = y. $$ Namely, every subset of $p$ equations will give you another set of $\hat{\beta}$. Hence, you cannot just solve this system of equation rather you have to find an approximate solution. Such a solution lives in the column space of $X$ (as every solution of $Ax=b$ belongs, by definition, to the column space of $A$). Now, you are interested in the "best" solution, namely, a vector $\beta$ that will solve the modified system of linear equations. Namely, instead of solving $X\beta = y$ you can solve $X'X\beta = X'y$ which will have a unique solution w.r.t. $\beta$ and instead of the original $y$ you will have $\hat{y}$ that is a vector that belongs to $C(X)$ and it is the "closest" possible vector in $C(X)$ to the original $y$.
Why it is the best? A matrix $P=A(A'A)^{-1}A'$ is a projection matrix into the column space of $A$ (why it has this specific form you can read in the link that is given in the comments). Why does $P^2 = P$? Let us see what does $X(X'X)^{-1}X'$ to$x$, where $x \in C(X)$. If $x$ is already in the column space of $X$ thus "projecting" it on $C(X)$ will do nothing, i.e., will return $x$ itself. $P^2 = PP$ is in a sense like projecting set of vectors from $C(X)$ onto $C(X)$, hence you should get $P$ itself.