Why isn't the linear regression coefficient not just the average vector to data points?

309 Views Asked by At

I am having trouble intuitively understanding the correctness of the formula to compute the coefficient for the regression line in a linear regression.

I know the formula is

$$\frac{\sum_{i=1}^N (x_i - \bar{x}) (y_i - \bar{y})}{\sum_{i=1}^N(x_i - \bar{x})^2}$$

I have at some point gone through the proof and mechanically understood it. But intuitively I still don't see why above formula computes the correct coefficient. In fact, intuitively I would have said, the coefficient for the regression line needs to be the average ratio of $y_i$ and $x_i$, $(x_i, y_i)$ being the data points.

I wrote a small Jupyter-Notebook to illustrate this. I found that my naive approach is not completely wrong and in fact converges towards the correct value with more data, if the data scatters in a fixed intervall.

So... what is the critical points that my naive approach gets wrong and what is the intuitive explanation for why the correct formula works better?

3

There are 3 best solutions below

0
On BEST ANSWER

Continuing with your simplifying assumptions, let's assume for simplicity that $\bar x=0$ and $\bar y=0$, so the standard solution is

$$ \frac{\sum_{i=1}^N x_iy_i}{\sum_{i=1}^Nx_i^2}\;. $$

We can write this as

$$ \frac{\sum_{i=1}^N x_i^2\frac{y_i}{x_i}}{\sum_{i=1}^Nx_i^2}\;. $$

So it's actually a weighted average of the ratios $\frac{y_i}{x_i}$, with weights $x_i^2$, not as different from your proposed solution as you perhaps thought it was.

The question remains why the weights $x_i^2$ in the standard solution are better than the equal weights that you propose to use. This is because under the standard assumption that the $y_i$ all have the same additive error, the errors of values near the origin get amplified when you take the ratio $\frac{y_i}{x_i}$ with small values of $x_i$. It's intuitively clear than when you shift a data point near the origin by a certain vertical error, that changes the ratio more than if you do it with a data point further away; so the ratios for small $x_i$ are more uncertain and should carry less weight.

In fact, this can be stated more quantitatively. If you perform a linear regression with different error bars for the different data points, you find that each data point should be weighted with the inverse of its variance, that is, the inverse square of its standard deviation. Forming the ratio $\frac{y_i}{x_i}$ amplifies the error in $y_i$ by a factor $\frac1{x_i}$, so if we assume that the errors in the $y_i$ are all the same, the errors in the ratios are proportional to $\frac1{x_i}$, so the weights should be proportional to the inverse squares of those errors, that is, to $x_i^2$. So the standard formula is in fact just your formula, properly weighted.

0
On

Correlation is symmetrical: The correlation between $X$ and $Y$ is the same as the correlation between $Y$ and $X.$

Regression is not symmetrical. To take simple linear regression as an example, the regression line of $Y$ on $x$ can be viewed as the best way to model (perhaps eventually predict) values of $Y$ for given values of $x$ in the dataset. (Or in the case of prediction, for new values of $x$ not in the dataset used to compute the regression line.) The regression model is $Y_i = \beta_0 + \beta_1 x_i + e_i,$ where $e_i$ are independently distributed $\mathsf{Norm}(0, \sigma).$

The derivation you looked at involved finding y-intercept $\hat \beta_0$ and slope $\hat \beta_1$ that minimize $\sum_{i=1}^n (Y_i - \hat Y_i)^2,$ where $\hat Y_i = \hat \beta_0 + \hat\beta_1 x_i.$ (The regression line is often called the 'least-squares' line.)

If you reverse the roles of $Y_i$ and $x_i$ (ascribing errors to the $X$'s instead of $y$'s) to find the regression of $X$ on $y,$ you will typically get a different regression line. The regression model would be $X_i = \beta_0^\prime + \beta_1^\prime y_i + e_i^\prime,$ where $e_i^\prime$ are independently distributed $\mathsf{Norm}(0, \sigma^\prime).$ [Primes (${}^\prime$) indicate alternative constants, not differentiation.]

In terms of units: For a slightly different perspective, consider modeling weights of collegiate swimmers $(Y_i)$ in kg in terms of their heights $(x_i)$ in cm. Then units of $\beta_0$ would be kg, and units of $\beta_1$ would be kg/cm. One can show that $\hat \beta_1 = rS_y/S_x,$ where the sample correlation $r$ has no units, the units of the sample standard deviation $S_y$ are kg, and the units of the sample standard deviation $S_x$ are cm.

By contrast, if you were modeling heights in terms of weights, then the units of $\hat \beta_1^\prime$ would be cm/kg. But $\hat\beta_1^\prime \ne 1/\hat\beta_1,$ unless $r = 1,$ so that the data fit a straight line precisely.

0
On

It is more of a comment that an answer, but still maybe illustrative. What you have observed in your simulation is the fact that your estimator is unbiased and consistent. Namely, for a model that is $y_i = \beta x_i + \epsilon_i$, where $\mathbb{E}[\epsilon_i|X]=0$ with a finite variance, the estimator $$ \frac{1}{n}\sum_{i=1}^n\frac{y_i}{x_i}, $$ is basically a legitimate estimator of $\beta$. I.e., note that $$ \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\frac{y_i}{x_i},|X\right]=\frac{1}{n}\sum\frac{\beta x_i}{x_i} = \beta \frac{n}{n} = \beta. $$ Which means, intuitively, that the "mass" center of the estimated line will be at the actual line. And, by the WLLN $$ \frac{1}{n}\sum_{i=1}^n\frac{y_i}{x_i} \xrightarrow{p}\mathbb{E}\left[\frac{Y}{X}\right] = \beta, $$ for $n \to \infty$. This is what you observed by increasing the number of observations the estimated line got closer to the real line. So why use the "unintiuitive" OLS estimator? The answer to this question you already got in previous posts. In a nutshell, although your estimator is legitimate estimator, it is not an optimal one. When the common optimaility criteria is the squared error, the $$ \sum_{i=1}^n\frac{x_i}{\sum x_i ^ 2}y_i $$
is the best (linear unbiased) estimator of $\beta$.