Std and confidence intervals for Linear Regression coefficients

172 Views Asked by At

When fitting a linear regression model in R for example, we get as an output all the coefficients $\beta_i$ along with some other properties like the standard deviation and a 95% CI for each coefficient.

For me, linear regression is an optimization problem, we're trying to find $\beta$ that minimizes : $$ \max_{i \,\in\, [|1,N|]} |y_i-\beta^Tx_i -\beta_0| $$ So hopefully we find and optimal $\hat{\beta}$. With this construction, I don't understand how this would result in any std for any $\beta_i$.

I guess programming languages like R and Python do this differently? Let's say we have a sample with $N$ $x_i$, do they actually perform the optimization on different subsets of the data, which would yield different optimums and then average them? I would really like to understand this aspect of linear regression because it's been bugging me for a while now.

Thank you all.

1

There are 1 best solutions below

3
On BEST ANSWER

You have observed values of $x_{1,i}, \ldots, x_{p,i}$ for $i=1,\dots,n$ and $n\gg p,$ and observed values of $y_i,$ and you want to estimate coefficients $\beta_0,\beta_1,\ldots,\beta_p$ for which

$$y_i = \beta_1+\beta_1 x_{1,i} + \cdots+ \beta_p x_{p,i} + \text{error}_i \text{ for } i = 1,\ldots, n. \tag 1 $$

The most usual method of estimation is that the estimates $\widehat \beta_0,\widehat\beta_1,\ldots,\widehat\beta_p$ are the values of $\beta_0,\beta_1,\ldots,\beta_p$ that minimize the sum of squares $$ \sum_{k=1}^n (y_i - (\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_p x_{p,i}))^2. $$ That's how R does it if you use the lm command.

When it's done that way, the mapping $$ \left[ \begin{array}{c} y_1 \\ \vdots \\ \vdots \\ \vdots \\ y_n \end{array} \right] \mapsto \left[ \begin{array}{c} \widehat\beta_0 \\ \widehat\beta_1 \\ \vdots \\ \widehat\beta_p \end{array} \right] \qquad (\text{with all $x_{i,j}$ fixed}) $$ is linear. Call the vector to the left of the arrow $Y$ and the one to its right $\widehat\beta,$ and let $X$ be the $(p+1)\times n$ matrix $$ \left[ \begin{array}{rrcr} 1 & x_{1,1} & \cdots & x_{p,1} \\ \vdots & \vdots & & \vdots \\ 1 & x_{1,n} & \cdots & x_{p,n} \end{array} \right]. $$ Then you can show that $$ \widehat\beta = (X^T X)^{-1} X^T Y. $$ Often it is assumed that the "errors" in line $(1)$ above are independent random variables distributed as $\operatorname{Normal}(0,\sigma^2).$ Then one can deduce that \begin{align} \widehat\beta \sim {} & \operatorname{Normal}(\beta, \sigma^2(X^T X)^{-1}) \\ & \text{(a $p$-dimensional normal distribution} \tag 2 \\ & \phantom{(} \text{with a $(p+1)\times1$ expected value} \\ & \phantom{(} \text{and a $(p+1)\times(p+1)$ variance matrix.)} \\[10pt] & \frac 1 {\sigma^2}\sum_{k=1}^n (y_i - (\widehat\beta_0 + \widehat\beta_1 x_{1,i} + \cdots + \widehat\beta_p x_{p,i}))^2 \sim \chi^2_{n-p-1} \tag 3 \\[10pt] \text{and } & \text{lines $(2)$ and $3$ above are independent.} \tag 4 \end{align} By using $(2),$ $(3),$ and $(4)$ and the theory of the t-distribution, one can deduce confidence intervals for any $\beta_k$ for $k=0,\ldots,p.$