Errors in estimates of intercept and slope in least squares method

1.2k Views Asked by At

We want to find the line $y = mx + c$ that best "fits" the list of points $$(x_1, y_1), (x_2, y_2), ... (x_i, y_i), ... (x_n, y_n).$$ For each point there is no uncertiainty in $x$ and each $y_i$ has uncertainty $\sigma_i$

By minimizing the squares differences

$$[y_i - (mx_i + c)]^2$$

At first for $m$ and then for $c$ and then making a system we can find $m$ and $c.$ This I understand.

My doubt is how can we find uncertainties for $m$ and $c?$

My textbook just says that we can sum the partial derivatives of $m$ with respect to each $y_i$ multiplied by $\sigma_i^2$ (the uncertainty of the given $y_i)$ and performs the calculation in just one line.

Could anyone explain in more detail how to calculate such errors? I think seeing an example with 3 points would make me understand the concept without the trouble of too much notation.

Here is the calculation of my book that gives me trouble:

$$Var[m] = \sum (\frac{dm}{dY_i})^2*\sigma_{Y_i}^2 =$$ $$= (x_i/\sigma_(Y_i)^2- \overline x / \sigma_{Y_i}^2 )^2 * \frac{1}{Var[x]^2}*\frac{1}{\sum\frac{1}{\sigma_{Y_i}^2}} =$$ $$= \frac{1}{Var[x]*\sum\frac{1}{\sigma_{Y_i}^2}}$$

At the start of the second line my book uses $x_i$ outside of any summation symbol, I think that is an error.

2

There are 2 best solutions below

2
On

Let $X$ denote the matrix with 1s in the first column and $x_i$ in the second column. Then $(m,c) = (X^TX)^{-1}X^Ty$. Let $r$ denote the first row of $(X^TX)^{-1}X^T$, then $$m=r^T y = \sum_i r_i y_i.$$ Assuming the uncertainties of $y_i$ are uncorrelated, you get $$\sigma^2(m)=\sum_i r_i^2 \sigma^2(y_i).$$ You can do the same for $c$ using the second row of $(X^TX)^{-1}X^T$.

3
On

I could be wrong, but I doubt that a demonstration with $n = 3$ will do much for your intuition. In that case, by the time we estimate the slope and intercept, we do not have a good estimate of variability. Maybe it will help to show several regressions with $n = 20.$

Model. Suppose $x_i = i,$ for $i = 1, 2, \dots, 20.$ That is the $x$'s are just the integers from $1$ through $20.$ If the true y-intercept is $\beta_0 = 4$ and the true slope is $\beta_1 = 1.5,$ then the regression model is $$Y_i = \beta_0 + \beta_1 + e_i,$$ where the errors are independently $e_i \sim \mathsf{Norm}(\mu=0,\, \sigma=2).$

Then we can generate the $Y_i$ in R statistical software, according to this known model, as follows.

set.seed(622);  n= 20;  b0 = 4;  b1 = 1.5;  sg = 2;  x = 1:n
y = b0 + b1*x + rnorm(n, 0, sg)
plot(x, y, pch=19);  abline(a = b0, b=b1, col="green3")

The 20 points $(x_i, Y_i)$ are plotted below along with the known linear relationship $Y_i = 4 + 1.5x_i$ and the variability of the points shows the effect of the normal errors $e_i$.

enter image description here

Regression line. How accurately can the regression line (least squares line) recover the information about our model? In R, we find the estimated y-intercept $\hat\beta_0 = 3.519$ and the estimated slope $\hat\beta_1 = 1.551,$ as follows:

lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      3.519        1.551  

The figure below shows the same data points and true model (green) as above, along with the regression line (dashed red). The regression line is not a perfect copy of the true model, but it is close enough to be useful. Of course, a different experiment using the same model would have different random errors $e_i$, so another experiment would give a slightly different regression line.

enter image description here

Distribution of estimates of slope $\beta_1.$ Essentially, your question asks for the distribution of the estimates of the y-intercept and slope. For simplicity, we focus just on the slope. We repeat the regression procedure 100,000 times. Each time we use $n=20,\, \beta_0 = 4,\, \beta_1 = 1.5,$ and $\sigma=2.$ However, on each iteration we simulate new new errors $e_i,$ in order to try to understand the variability in the distribution of $\hat\beta_1.$

 set.seed(622);  m = 10^5;  n = 20;  x = 1:n;  b0 = 4;  b1=1.5;  sg = 2
 b1.hat = replicate(m, lm(b0 + b1*x + rnorm(n,0,sg) ~ x)$coef[2])
 mean(b1.hat);  sd(b1.hat)
 ## 1.499939    # Expected(b1.hat) aprx 1.5
 ## 0.07764731  # SD(b1.hat)
 hist(b1.hat, prob=T, col="skyblue2", main="Distribution of Slope Estimates")

enter image description here

The simulation suggests that $E(\hat\beta_1) = \beta_1 = 1.5.$ According to statistical theory, $$SD(\hat\beta_1) = \sigma/\sqrt{(n-1)S_x^2} = 2/\sqrt{665} = 0.07756,$$ which is well-approximated (by 0.07765) in the simulation.

The histogram seems to be approximately normal in shape. More precisely,

$$\frac{\hat\beta_2-\beta_0}{S_{Y|x}/\sqrt{(n-1)S_x^2}}\sim \mathsf{T}(\nu = n-2),$$

Student's t distribution with $n-2$ degrees of freedom, where $S_{Y|x}$ is an estimate of $\sigma = 2$ (using $x$'s and $Y$'s). So the exact distribution of $\hat \beta_1$ is based on a t distribution that is nearly normal. In practical applications, the t distribution can be used to make a 95% confidence interval for the unknown true slope $\beta_1.$