Confusion in Relationship between regression line slope and covariance

4.1k Views Asked by At

In simple linear regression model between RVs $(X,Y)$, the slope $\hat\beta_1$ is given as

$$ \hat\beta_1 = \dfrac{\sum_i^N(x-\overline{x})(y - \overline{y})}{\sum_i^N(x - \overline{x})^2} \tag{1} $$

This is then interpreted quickly in relation to Covariance and Varaince in many text books 1, as

$$ \hat\beta_1 = \dfrac{Cov(x,y)}{Var(x)} \tag{2} $$

Question:
But couldn't this only be true if we assume uniform distribution of both joint pmf in covariance and pmf in varaince? That is, it is like assuming as below, and cancelling out $\dfrac{1}{N}$?

$$ \hat\beta_1 = \dfrac{\dfrac{1}{N}\sum_i^N(x-\overline{x})(y - \overline{y})}{\dfrac{1}{N}\sum_i^N(x - \overline{x})^2} \tag{3} $$

In case both pmfs not uniform,

$$ \dfrac{Cov(x,y)}{Var(x)} = \dfrac{\sum\limits_{x}\sum\limits_{y}(x-\overline{x})(y - \overline{y})p(x,y)}{\sum\limits_{x}(x - \overline{x})^2p(x)} \tag{4} $$

which is not same as (1), so (2) cant be true, right?

2

There are 2 best solutions below

5
On

The OP's Eq. $(1)$ is the slope of the regression line if we have $N$ pairs $(x_i, y_i)$ of real numbers. and ask what is the "best" straight line that fits these $N$ data points. In general, it is not the slope of the regression line when we have a pair of random variables $(X, Y)$ and ask what is the random variable $\hat{Y} = \alpha + \beta X$ such that $E[(Y-\hat{Y})^2]$ is as small as possible. The answer to the latter question is indeed that $\beta$ must have value $\frac{\operatorname{cov}(X,Y)}{\operatorname{var}(X)}$ as the OP states in $(2)$ but this result applies to all random variables with finite variances, not just discrete random variables. Indeed, if $X$ and $Y$ are discrete random variables taking on values $x_1, x_2, \ldots, x_M$ and $y_1,y_2,\ldots, y_N$ respectively, then the covariance $\operatorname{cov}(X,Y)$ is given by \begin{align}\operatorname{cov}(X,Y) &= \sum_{m=1}^M \sum_{n=1}^N P(X=x_m, Y = y_n)(x_m-\bar{x})(y_n-\bar{y})\\&= \sum_{m=1}^M \sum_{n=1}^N p_{X,Y}(x_m, y_n)(x_m-\bar{x})(y_n-\bar{y})\end{align} where $\bar{x}$ and $\bar{y}$ are the means $E[X]$ and $E[Y]$ respectively and $p_{X,Y}(x_m, y_n)$ is the joint probability mass function (joint pmf) of $(X,Y)$. This is a slightly more general version of the numerator of $(4)$ in the OP's question. As the OP correctly asserts, if $M=N$ and the joint pmf has value $\frac 1N$ for exactly $N$ points $(x_i,y_i)$, then it is indeed the case that $\operatorname{cov}(X,Y)$ is (proportional to) the numerator of $(1)$ in the OP's question.

0
On

I think my confusion stems from failing to differentiate sample correlation coefficient from population correlation coefficient. So I will try to summarize my improved understanding here, instead of in individual comments, and request viewers to correct me.

In case of Sample Correlation Coefficient:
Suppose $(X,Y)$ is a given sample set of size $N$. Then, the sample correlation coefficient is given by

$$ r = \dfrac{\sum_i(x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_i(x_i - \overline{x})^2 \sum_i(y_i - \overline{y})^2}} \tag{1} $$

where, $\mathrm{cov}(X,Y)$ is again a sample unbiased covariance, $(s_X,s_Y)$ are sample unbiased standard deviations. For given sample set, (also as per MLE), the assumption is samples are uniformly distributed. That is,

$$ \mathrm{cov}(X,Y) = \dfrac{1}{N-1}\sum_i(x_i - \overline{x})(y_i - \overline{y}) \tag{2} $$

$$ s_X = \dfrac{1}{N-1}\sum_i(x_i - \overline{x})^2 \\ s_Y = \dfrac{1}{N-1}\sum_i(y_i - \overline{y})^2 \tag{3} $$

Applying equations (2) and (3), in equation (1), we get,

$$ r = \dfrac{\sum_i(x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_i(x_i - \overline{x})^2 \sum_i(y_i - \overline{y})^2}} = \dfrac{\mathrm{cov}(X,Y)}{s_X s_Y} \tag{4} $$

Applying similarly in Simple regression line slope

$$ \beta_1 = \dfrac{\sum_i(x_i - \overline{x})(y_i - \overline{y})}{\sum_i (x_i - \overline{x})^2} = \dfrac{\mathrm{cov(X,Y)}}{s_X^2} \tag{5} $$

In case of Population Correlation Coefficient:
Suppose $(X,Y)$ are two RVs (can be discrete or continuous, for simplicity, here we take discrete) with joint pmf $p(X,Y)$, and marginal pmfs $p(X), p(Y)$, then

$$ \rho = \dfrac{\sum_x \sum_y (x - \mu_X)(y - \mu_Y)p(X,Y)}{\sqrt{\sum_x (x - \mu_X)^2p(X) \sum_y (y - \mu_Y)^2p(Y)} } \tag{6} $$

where $\mathrm{Cov}(X,Y)$ is population covariance (there is no bias here, as its population itself)., and $(\sigma_X, \sigma_Y)$ are respective individual population standard deviations of $(X,Y)$ respectively. For given population, with their joint and marginal pmfs,

$$ \mathrm{Cov}(X,Y) = \sum_x \sum_y (x - \mu_X)(y - \mu_Y)p(X,Y) \tag{7} \\ $$

$$ \sigma_X^2 = \sum_x (x - \mu_X)^2p(X) \\ \sigma_Y^2 = \sum_y (y - \mu_Y)^2p(Y) \tag{8} \\ $$

Applying equations (8) and (7) in (6), we get the simplified form

$$ \rho = \dfrac{\sum_x \sum_y (x - \mu_X)(y - \mu_Y)p(X,Y)}{\sqrt{\sum_x (x - \mu_X)^2p(X) \sum_y (y - \mu_Y)^2p(Y)} } = \dfrac{\mathrm{Cov}(X,Y)}{\sigma_X\sigma_Y} \tag{9} $$

Applying similarly in linear regression line slope for population,

$$ \beta_1 = \dfrac{\mathrm{Cov}(X,Y)}{\sigma_X^2} $$

Pending gaps:
If my above approach is correct, then I have another question on how to prove equation (7) and (6) directly, individually without just saying its analogous for sample case?