Is $X$ (independent variable) considered random in linear regression? Does correlation make sense as an unbiased estimator?

46 Views Asked by At

Basically my question comes from two parts. In simple definitions of LR, we consider X to be a known constant. Thus $$var(Y) = Var(b_0+ b_1x + e) = Var(e)$$. Great.

However, a great deal of linear regression has to do with Cov(X,Y). This is used to estimate our parameters, and it assumes that X is a random variable.

How do we distinguish between these two schools of thought?

I'm ultimately trying to show that sample correlation between X and Y for simple linear regression is an unbiased estimator of true correlation between X and Y. Does this question even make sense, given that sample correlation of $$Corr(X,\hat Y) = Corr(X, b_0 + b_1 X) = Corr(X, b_1 X)=1$$ always? But even if this doesn't make sense, aren't we still able to make claims in linear regression about the correlation between X and Y because $$E[b_1] = E[\beta_1] \\ b_1 = r\frac{\sigma_y}{\sigma_x}$$

1

There are 1 best solutions below

0
On

Covariance is defined as $$Cov(X,Y) = \mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y)$$

If $Y = a + bX + \varepsilon$, then:

$$Cov(X,Y) = \mathbb{E}((a + bX + \varepsilon)X) - \mathbb{E}(a + bX + \varepsilon)\mathbb{E}(X)$$ $$Cov(X,Y) = a\mathbb{E}(X) + b\mathbb{E}(X^2) + \mathbb{E}(X \varepsilon) - a\mathbb{E}(X) - b\mathbb{E}(X)^2 - \mathbb{E}(\varepsilon)\mathbb{E}(X)$$

Since $a\mathbb{E}(X)$ drops out and $\varepsilon$ is assumed to be independent of $X$ we have $\mathbb{E}(\varepsilon)\mathbb{E}(X) = \mathbb{E}(\varepsilon X)$.

Therefore: $$Cov(X,Y) = b(\mathbb{E}(X^2) - \mathbb{E}(X)^2) = b \sigma_X^2$$

Since $X$ and $\varepsilon$ are independent, we find $\sigma_Y = \sqrt{b^2 \sigma_X^2 + \sigma_\varepsilon^2}$.

So the correlation coefficient is: $$Corr(X,Y) = Cov(X,Y)/(\sigma_X \sigma_Y) = \frac{b \sigma_X^2}{\sqrt{b^2 \sigma_X^2 + \varepsilon^2} \cdot \sigma_X} = \sqrt{\frac{b^2 \sigma_X^2}{b^2 \sigma_X^2 + \varepsilon^2}}$$

So the correlation is kinda dependent on the ratio between the explained and the unexplained behaviour.

I'm not sure whether linear regression always optimizes to an $a, b$ and $\varepsilon$ such that $Corr(X,Y_{pred}) = Corr(X,Y_{real})$ though.