Why is the denominator $N-p-1$ in estimation of variance?

3.8k Views Asked by At

I was recently going through the book Elements of Statistical Learning by Tibshirani et.al. In this book, while explaining the ordinary least squares model, the authors state that assume that $y_i \epsilon \mathbb{R}$ represents the observed variables, $\hat{y_i}$ represents the model output and $\mathbf{x_i} \epsilon \mathbf{R}^{p+1}$ represent the inputs. If the $y_i$s are assumed to be uncorrelated and have constant with variance $\sigma$, then the unbiased estimate of variance is $\hat{\sigma} = \frac{1}{\left (N-p-1 \right)}\sum\left( y_i - \hat{y_i} \right)^2$, summation being done from $i=1$ to $i=N$. Note that $p$ has been used here to denote the dimensionality of $\mathbf{x_i}$s. My question is why is the factor in the denominator $N-p-1$ while estimating the variance of $y_i$s i.e. $\hat{\sigma}$ ? From my understanding if the $y_s$s are real numbers that have constant variance, the factor should be equal to $N-1$.

2

There are 2 best solutions below

2
On BEST ANSWER

You can show that $\sum(y_i-\hat{y}_i)^2\sim\sigma^2\chi^2_{N-p-1}$. As expectation of a $\chi^2_{N-p-1}$ is $(N-p-1)$. Hence $\mathbb{E}(\frac{1}{N-p-1}\sum(y_i-\hat{y}_i)^2)=\sigma^2$.

$N-p-1$ is in the denominator to make the estimator unbiased.

2
On

The current accepted answer is flawed, as it implicitly assumes that the error of the model $\varepsilon$ is Gaussian (otherwise you need not have $\sum(y_i-\hat{y}_i)^2\sim\sigma^2\chi^2_{N-p-1}$).

Here's a proof with the general assumption that $\varepsilon$ has mean $0$ and variance $\sigma^2 I_N$.

First note that $\sum(y_i-\hat{y}_i)^2=\|y-X\hat\beta\|^2$.

We have $$\begin{align} y-X\hat\beta &= X\beta +\varepsilon -X(X^TX)^{-1}X^T(X\beta +\varepsilon)\\ &=X\beta +\varepsilon - X\beta -X(X^TX)^{-1}X^T\varepsilon\\ &= (I_N-H)\varepsilon\end{align}$$ where $H=X(X^TX)^{-1}X^T$ is the hat matrix. It's easy to check that $H^T=H$ and $H^2=H$ (indeed the hat matrix is merely the orthogonal projection on $\operatorname{Im}X$).

Hence $\begin{aligned}[t]E( \|y-X\hat\beta\|^2) &= E(\varepsilon^T(I_N-H)^T (I_N-H)\varepsilon)=E(\varepsilon^T(I_N-H)\varepsilon) \end{aligned}$

Note that $\varepsilon^T(I_N-H)\varepsilon=\sum_{i,j} \varepsilon_i\varepsilon_j (\delta_{ij}-H_{ij})$, thus $$E(\varepsilon^T(I_N-H)\varepsilon)=\sum_{i,j} \sigma^2\delta_{ij} (\delta_{ij}-H_{ij})=\sigma^2(N-\operatorname{tr}H)$$

Note that $\operatorname{tr}H =\operatorname{tr}(X(X^TX)^{-1}X^T)=\operatorname{tr}(X^TX(X^TX)^{-1})=\operatorname{tr}(I_{p+1})=p+1 $

Putting everything together, $E( \|y-X\hat\beta\|^2)=\sigma^2(N-p-1)$