Estimating coefficient in linear regression

71 Views Asked by At

Given: $y = b_0 + b_1x$

I am wondering what is the explanation behind this formula for estimating the $ b_1$ coefficient:

$$ b_1 = \frac{\sum_{i=1}^n( x_i-\bar{x})(y_i-\bar{y})}{ \sum_{i=1}^n( x_i-\bar{x})^2 } $$

What are the steps to derive this formula?

Part 1 Update -March 18 2021:

When tried to substitute $\bar{y} - b_1\bar{x}$ for $b_0$ in

$$ b_0 \bar{x} + b_1 \overline{x^2} = \overline{xy} $$ I got stuck with $b_1$ in both sides of the equations.

$$ b_1 \overline{x^2} = \overline{xy}-(\overline{x} \bar{y} - b_1\overline{x^2}) $$

Can you please guide me in further derivation steps. Thanks

Part 2 Update

With another help from @MartinVesely, I realized that this should be:

$$b_0 \bar{x} + b_1 \overline{x^2} = \overline{xy}$$

$$ ((\bar{y} - b_1\bar{x})\bar{x}) + b_1 \overline{x^2} = \overline{xy}$$

$$(\bar{x}\bar{y} - b_1(\bar{x})^2) + b_1 \overline{x^2} = \overline{xy}$$

$$( - b_1(\bar{x})^2) + b_1 \overline{x^2} = \overline{xy} - \bar{x}\bar{y} $$

$$b1( -(\bar{x})^2 + \overline{x^2}) = \overline{xy} - \bar{x}\bar{y} $$

$$b1= \frac{\overline{xy} - \bar{x}\bar{y} }{ \overline{x^2} -(\bar{x})^2} $$

1

There are 1 best solutions below

1
On BEST ANSWER

A derivation of the formula is done with the least square method.

Firstly write down a function $L = \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2$. This is a sum of squared differences between actual output data $y_i$ and output given by a regression line.

Our goal is to minimize a difference between actual data and theregression line. This means that we need to calculate first derivatives with respects to $b_0$ and $b_1$:

$$ \frac{\partial L}{\partial b_0} = -\sum_{i=1}^n 2(y_i - b_0 - b_1 x_i) $$

$$ \frac{\partial L}{\partial b_1} = -\sum_{i=1}^n 2x_i(y_i - b_0 - b_1 x_i) $$

Now, by setting $\frac{\partial L}{\partial b_0}$ and $\frac{\partial L}{\partial b_1}$ equal to zero and dividing by -2 we have

$$ \sum_{i=1}^n (y_i - b_0 - b_1 x_i) = 0 $$

$$ \sum_{i=1}^n x_i(y_i - b_0 - b_1 x_i) = 0 $$

Rewriting leads to $$ \sum_{i=1}^n (y_i - b_0 - b_1 x_i) = \sum_{i=1}^n y_i - b_1\sum_{i=1}^n x_i - nb_o = 0 $$

$$ \sum_{i=1}^n x_i(y_i - b_0 - b_1 x_i) = \sum_{i=1}^n x_iy_i - b_1\sum_{i=1}^n x_i^2 -b_0\sum_{i=1}^n x_i = 0 $$

Now, if we divide both eqautions by $n$ and rearranging them, we have $$ b_0 + b_1 \bar{x} = \bar{y} $$

$$ b_0 \bar{x} + b_1 \overline{x^2} = \overline{xy}, $$

where $\bar{x}$ is average of $x_i$ values (similarly for $y_i$) and $\overline{xy}$ is average of products $x_iy_i$.

Clearly $b_0 = \bar{y} - b_1\bar{x}$. After substituing this to the other equation we get $$ b_1 = \frac{\overline{xy} -\bar{x}\bar{y}}{\overline{x^2}-(\bar{x})^2}. $$

Since $\overline{xy} -\bar{x}\bar{y}$ is covariance of $x$ and $y$ and $\overline{x^2}-(\bar{x})^2$ is variance of $x$ we have your formula, because $$ \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 $$ is variance of $x$ and $$ \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) $$ is covariance of $x$ and $y$.