How is the regression coefficient derived?

133 Views Asked by At

Searched quite a bit in the forum but cant find this. In Wikipedia the derivation of the regression coefficient for simple linear regression is skipped and points to a book that isn't freely available.

Instead they just present the results, but I don't find any somewhat friendly derivation, of the steps that go from $\beta_0$, $\beta_1$ to $r_{xy}$.

How is the regression coefficient derived ?

I.e Starting with the first equality for $\beta$ below how do you get to the last definition of $r_{xy}$ ?


Relevant images linked above

enter image description here

I get the steps from the bare linear formula to the one before $r_{xy}$

And this is the definition:

enter image description here

There is also a Wikipedia page for the derivations of quantities but only involved in the least squares error, and does not include the regression coefficient.

2

There are 2 best solutions below

0
On BEST ANSWER
  • The formula for $\beta$ (after the minimization) is

\begin{align} \beta &= \frac{\mathrm{Cov}(X, Y)}{\mathrm{Var}(X)} \\ &= \frac{\sum_i (x_i - \bar{x})\, (y_i - \bar{y})}{\sum_i (x_i-\bar{x})^2} \end{align}

  1. The Variance can be rewritten (bottom of the post for details)

$$ \mathrm{Var}(X) = n\,(\overline{x^2} - \bar{x}^2)$$

  1. The Covariance can be rewritten

\begin{align} \mathrm{Cov}(X,Y) = n\,(\overline{x\,y} - \bar{x}\bar{y}) \end{align}

  1. Replacing we have:

\begin{align} \beta = \frac{\overline{x\,y} - \bar{x}\bar{y}}{\overline{x^2} - \bar{x}^2} \end{align}

  1. And because we derived $S_x^2$ formulas:

\begin{align} \beta = \left(\frac{\overline{x\,y} - \bar{x}\bar{y}}{\sqrt{(\overline{x^2} - \bar{x}^2)\,(\overline{y^2} - \bar{y}^2)}}\right)\, \frac{S_y}{S_x} \end{align}

  • Where now the term within parens is $r_{xy}$

Some minor details

Formula for the Variance

\begin{align} \sum_i (x_i-\bar{x})^2 &= (\sum_i x_i^2 - 2\,x_i\,\bar{x}) + n\bar{x}^2 \\ &= n\,(\overline{x^2} - 2 \bar{x}^2) + n\, \bar{x}^2 \\ &= n\,(\overline{x^2} - \bar{x}^2) \\ \end{align}

Formula for the Covariance

\begin{align} \mathrm{Cov}(X,Y) &= \sum_i (x_i - \bar{x})\, (y_i - \bar{y}) \\ &= n\,\overline{x\,y} + n\,\bar{x}\bar{y} - (\sum_i x_i\bar{y} + y_i\bar{x}) \\ &= n\,(\overline{x\,y} - \bar{x}\bar{y}) \end{align}

1
On

So, let us try to show how both $\widehat\beta_0$ and $\widehat\beta_1$ are derived. Let us assume the following form for the original relation: $$ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i $$ Here the true regression model parameters $\beta_{0,1}$ are unobserved, so that our task is to estimate them, building a proper estimated relation between $x_i$ and $y_i$: $$ \widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1x_i \\ y_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i + r_i $$ where $r_i$ is the $i^{\text{th}}$ residual (the difference between the sample observed $y_i$ and the estimated one, $\widehat{y}_i$) that builds up the estimated $y_i$, so that $\widehat{y}_i=y_i - r_i$.

Recall that the regression we are trying to build here is called the “least squares regression”, so essentially our task is to minimize the sum of squared “errors” aka residuals.

Sum of squared errors (SSE) can thus be defined as follows: $$ SSE = \sum\limits_{i=1}^n \left(y_i - \widehat{y}_i\right)^2 = (\text{expanding the estimated } y_i) = \sum\limits_{i=1}^n \left(y_i - \widehat{\beta}_0 - \widehat{\beta}_1x_i\right)^2 $$

Now as we can control only $\widehat{\beta}_0$ and $\widehat{\beta}_1$, let us try to minimize the derived $SSE$ w.r.t. to each of the controllable parameters. It eventually means that we will have to find such $\widehat{\beta}^*_0$ and $\widehat{\beta}^*_1$ that $\left.\frac{\partial SSE}{\partial \widehat{\beta}_0}\right|_{\widehat{\beta}^*_0}=0$, and $\left.\frac{\partial SSE}{\partial \widehat{\beta}_1}\right|_{\widehat{\beta}^*_1}=0$: $$ \frac{\partial SSE}{\partial \widehat{\beta}_0} = \frac{\partial}{\partial{\widehat{\beta}_0}}\left(\sum\limits_{i=1}^n \left(y_i - \widehat{\beta}_0 - \widehat{\beta}_1x_i\right)^2\right) = \frac{\partial}{\partial{\widehat{\beta}_0}}\left(\sum\limits_{i=1}^n \left(\widehat{\beta}_0^2- 2\widehat{\beta}_0y_i+2\widehat{\beta}_0\widehat{\beta}_1x+\dots\right)\right)=2n\widehat{\beta}_0-2\sum\limits_{i=1}^n y_i +2\widehat{\beta}_1 \sum\limits_{i=1}^n x_i;\\ \left.\frac{\partial SSE}{\partial \widehat{\beta}_0}\right|_{\widehat{\beta}^*_0}=0 \Longleftrightarrow 2n\widehat{\beta}^*_0-2\sum\limits_{i=1}^n y_i +2\widehat{\beta}_1 \sum\limits_{i=1}^n x_i =0; \\ 2n \widehat{\beta}^*_0 = 2\sum\limits_{i=1}^n y_i -2\widehat{\beta}_1 \sum\limits_{i=1}^n x_i \Longleftrightarrow \widehat{\beta}^*_0 = \bar y - \widehat{\beta}_1 \bar x. $$

Same can be done for $\widehat{\beta}_1$, so that to find the optimal $\widehat{\beta}^*_1$ that minimizes the SSE. You can try deriving it yourself, it may take a while, but in the end you will get that: $$ \widehat{\beta}^*_1 =\frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n(x_i-\bar x)^2}. $$

Getting from there to $r_{x,y}$ in the given form is just the matter of recalling my comment, and of noting that we can express covariance between two variables in the following way: $$ \mathbb{C}\mathrm{ov}(x,y)=\mathbb{E}[xy]- \mathbb{E}[x]\mathbb{E}[y]. $$

Just note that in case of our regression we compute the sample covariance ($s_{x,y}$), not the population covariance $\left(\mathbb{C}\mathrm{ov}(x,y)\right)$, and so instead of $\mathbb{E}[x]$ and $\mathbb{E}[y]$ we deal with sample means: $\bar x$ and $\bar y$.

Hope that my answer was somehow helpful.