What is the distribution of residual in simple linear regression?

411 Views Asked by At

Suppose that $$Y_i=\beta_0+\beta_1x_i+\epsilon_i,$$ where $\epsilon_1,\ldots,\epsilon_n$ are independent random variables with identical normal distribution $N(0,\sigma^2)$. Let $$\hat{Y}_i=B_0+B_1x_i$$ be the least squares fit of the data $\{(x_1,Y_1),\ldots,(x_n,Y_n)\}$. My question is, what is the distribution of the residual $$e_i=Y_i-\hat{Y}_i \ ?$$ If it is not $N(0,\sigma^2)$, then in what sense are we justified to say that the residual $e_i$ "mimics" the error $\epsilon_i$ ? Are we justified to infer from the residual plot of $e_i$ anything about $\epsilon_i$ ? I know that $$\epsilon_i=Y_i-E[\hat{Y}_i],$$ but this is hardly a convincing reason to treat $e_i$ as if it "behaves like $\epsilon_i$."

1

There are 1 best solutions below

0
On BEST ANSWER

The $i$th residual has normal distribution, with mean zero and variance $$ \operatorname{Var}(e_i) = \sigma^2\left(1-\frac1n-\frac{(x_i-\bar x)^2}{\operatorname{SSX}}\right) $$ where SSX is shorthand for $\sum_k(x_k-\bar x)^2$. It has normal distribution because of the formulas $$ e_i=y_i-\hat y_i=(\epsilon_i-\bar\epsilon)-(B_1-\beta_1)(x_i-\bar x) $$ and $$B_1-\beta_1=\frac{\sum_k(x_k-\bar x)(\epsilon_k-\bar \epsilon)}{\operatorname{SSX}} $$ which express $e_i$ as a linear combination of the independent variables $\epsilon_1,\ldots,\epsilon_n$. The mean is computed as zero from the same formulas. An elementary derivation of the variance is found in this answer. A slicker but more advanced derivation is found in this answer.

More can be said: It turns out that the correlation between the residual $e_i$ and the error $\epsilon_i$ is $$\sqrt{1-\frac1n-\frac{(x_i-\bar x)^2}{\operatorname{SSX}}}. $$ This tells us that as $n$ gets large, the correlation tends to $1$, so that the residual is virtually identical to the error. This makes sense, because as the sample gets bigger, the estimators $B_0$ and $B_1$ converge to the true $\beta_0$ and $\beta_1$, hence the regression line $y=B_0+B_1 x$ converges to the theoretical line $y=\beta_0+\beta_1 x$.