Derivation of standard error of regression estimate with degrees of freedom

4.2k Views Asked by At

I am taking a course of Econometrics:

I need help to understand as to how do we arrive at the formula for standard error of regression $$\hat{\sigma}^2=\frac{\sum{e_i^2}}{n-k}.$$

I understand the bessel's correction required to remove the bias inherent in sample variance. The proof being available at Bessels Correction Proof of Correctness.

I also found Standard deviation of error in simple linear regression

How to derive the standard error of linear regression coefficient

But I could not find the proof for the above expression (standard error of regression estimate).

I tried to open the equation on the lines of Bessels Correction proof.

$$e_i=\text{Total SS}- \text{Explained SS}$$

Then I try to expand the Explained sum of squares term, but I got stuck at

$$ \sum _{i=1}^n \operatorname {E} \left((\beta\mathbf{ X}-\bar{y} )^2 \right) = \beta^2 E(x^2)-2\beta\bar{xy}+E(\bar{y}^2)$$

I don't know how to proceed. Can anyone please help ?

Then I read this :

The term "standard error" is more often used in the context of a regression model, and you can find it as "the standard error of regression". It is the square root of the sum of squared residuals from the regression - divided sometimes by sample size n (and then it is the maximum likelihood estimator of the standard deviation of the error term), or by $n−k$ ($k$ being the number of regressors), and then it is the ordinary least squares (OLS) estimator of the standard deviation of the error term.

on Standard Error vs. Standard Deviation of Sample Mean

Can anyone suggest a textbook where I can read about these derivations in more details ?

1

There are 1 best solutions below

4
On BEST ANSWER

Here's one way. This will work only if you understand matrix algebra and the geometry of $n$-dimensional Euclidean space.

The model says $y_i = \alpha_0 + \sum_{\ell=1}^k \alpha_\ell x_{\ell i} + \varepsilon_i, \quad i=1,\ldots,n $ where

  • $y_i$ and $x_{\ell i}$ are observed;
  • The $\alpha$s are not observed and are to be estimated by least squares;
  • The $\alpha$s are not random, i.e. if a new sample with all new $x$s and $y$s is taken, the $\alpha$ will not change;
  • The $x$s are in effect treated as not random. This is justified by saying we're interested in the conditional distribution of the $y$s given the $x$s. The $y$s are random only because the $\varepsilon$s are;
  • The $\varepsilon$s are not observed. The have expected value $0$ and variance $\sigma^2$ and are uncorrelated. These assumptions are weaker than those that normality and independence.

The $n\times(k+1)$ "design matrix" is $$ X= \begin{bmatrix} 1 & x_{11} & \cdots & x_{k1} \\ \vdots & \vdots & & \vdots \\ 1 & x_{1n} & \cdots & x_{kn} \end{bmatrix} $$ with independent columns and typically $n\gg k$.

The $(k+1)\times 1$ vector of coefficients to be estimated is $$ \alpha= \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \vdots \\ \alpha_k \end{bmatrix}. $$ The model can then be written as $Y= X\alpha+\varepsilon$, where $Y, \varepsilon \in\mathbb R^{n\times 1}$. Then $Y$ has expected value $X\alpha\in\mathbb R^{n\times 1}$ and variance $\sigma^2 I_n\in\mathbb R^{n\times n}$.

The "hat matrix" is $H = X(X^T X)^{-1} X^T$, an $n\times n$ matrix of rank $k+1$. The vector $\widehat Y = HY$ is the orthogonal projection of $Y$ onto the column space of $X$. It is also $\widehat Y=HY = X\widehat\alpha$, where $\widehat\alpha$ is the vector of least-squares estimates of the components of $\alpha$.

The residuals are $\widehat\varepsilon_i = e_i = Y_i-\widehat Y_i = Y_i-(\widehat\alpha_0 + \sum_{\ell=1}^k \widehat\alpha_\ell x_{\ell i})$. These are observable estimates of the unobservable errors. The vector of residuals is $$ \widehat\varepsilon = e = (I-H)Y. $$ This has expected value $(I-H)\operatorname{E}(Y) = (I-H)X\alpha = 0$.

We seek \begin{align} & \operatorname{E}(\|\widehat\varepsilon\|^2) = \operatorname{E}(\|e\|^2) \\[10pt] = {} & \operatorname{E} ( \Big((I-H)Y\Big)^T \Big((I-H)Y\Big)) \\[10pt] = {} & \operatorname{E} (Y^T (I-H) Y) \qquad \text{since } (I-H)^T = I-H = (I-H)^2. \text{ (Check that.)} \end{align} We've projected $Y$ onto the $(n-(k+1))$-dimensional column space of $I-H$. The expected value of the projection is $0$.

I claim the variance of the projection is just $\sigma^2$ times the identity operator on that $(n-(k+1))$-dimensional space. The reason for that is that $I-H$ is itself the identity operator on that $(n-(k+1))$-dimensional space, which is the orthogonal complement of the column space of $X$.

So it's as if we have a random vector $w$ in $(n-(k+1))$-dimensional space with expected value $0$ and variance $\sigma^2 I_{(n-(k+1))\times(n-(k+1))}$, and we're asking what $\operatorname{E}(\|w\|^2)$ is. And that is $\sigma^2(n-(k+1))$.

Hence the expected value of the sum of squares of residuals (which is the "unexplained" sum of squares) is $\sigma^2(n-(k+1))$.