I've been reading through the Wikipedia article on degrees of freedom (statistics). There is a section about residuals, in relation to least squares estimation. The article says:
Suppose you have some model $Y_i=a+bx_i + \epsilon_i \text{ for } i=1,...,n$.
Let $\hat a$ and $\hat b$ be least squares estimators of $a$ and $b$.
We can compute the residuals as follows: $\hat e_i=y_i-(\hat a + \hat b x_i)$.
The article then says that these residuals are constrained to lie within the space defined by:
$\hat e_1 + \dots + \hat e_n=0$ and $x_1 \hat e_1 + \dots + x_n \hat e_n=0$.
Hence, they say there are $n-2$ degrees of freedom for error.
So, my first question is, where have these two constraints come from?
I guess the first one comes from the fact that the mean of the residuals is supposed to be $0$. The second one, I am not sure about.
I suppose when they say there are $n-2$ degrees of freedom for error, it means the residuals are constrained to lie within an ($n-2$)-dimensional subspace? Hence, my second question is, why do these constraints mean that the residuals are constrained to an ($n-2$)-dimensional subspace?
When you form a design matrix $Z$ (with $n$ rows and $p$ columns), the column space of $Z$ spans some p-dimensional hyperplane. The residual vector -- formed from the difference between the response vector and its least squares projection onto the column space of $Z$ -- has to be perpendicular to the column space of $Z$. So the residual vector is constrained to lie within the left nullspace of $Z$. The dimension of the left nullspace of $Z$ is $n-p$, hence the residual vector has $n-p$ degrees of freedom.
In the example from the question, $p=2$, hence there are $n-2$ degrees of freedom for error. Furthermore, as stated, a residual vector is perpendicular to the column space. In other words, the residual vector is perpendicular to each column of $Z$. Hence, when you compute $\hat e_i ^T (\text{column of Z})$, this must equal zero. Each of these dot products gives each constraint. The first column of $Z$ is all $1$s, hence the sum of residuals is $0$. Each column after that will have actual data in it, hence each residual is multiplied by a covariant.