Regression residuals = x, what does that mean?

58 Views Asked by At

Imagine you run a linear regression and you observe that the residuals $r_x = x$. What does this say about our procedure / LR assumptions? More importantly, what does this say about the correlation between x and y?

I saw this question on another forum without an answer to it. I'm studying linear regression and would like some clarity on how to think about this kind of question.

My thoughts are that a) there's some regularization occuring such that our coefficient isn't accounting for the residuals

b)There is a confounding variable

What does this say about the correlation between x and y then?

1

There are 1 best solutions below

0
On

Some background. Standard model of ordinary linear regression assumes a linear relationship between two random variables $X$ and $Y$: $$ Y = \beta_0 + \beta_1 X + U $$ where random variable $U$ denotes the residuals, i.e. all unobserved factors influencing $Y$.

Given a sample $(y_i, x_i)$, $i = 1, \dots, n$, we can find the estimates $\widehat{\beta_0}, \widehat{\beta_1}$ of model parameters, for example by using ordinary least squares. Then we can write $$ \widehat{y}_i = \widehat{\beta_0} + \widehat{\beta_1} x_i $$ This way, we can express the estimates of the residuals $$ \widehat{u}_i = y_i - \widehat{y}_i $$ Furthermore, the estimate $\widehat{\beta_1}$ can be expressed as $$ \widehat{\beta_1} = \widehat{\rho}_{x,y} \frac{\widehat{\sigma}_{y}}{\widehat{\sigma}_{x}} $$ where $\widehat{\rho}_{x,y}$ is the estimate for the correlation between $X$ and $Y$ $$ \rho_{x,y} = \frac{\text{Cov}(X, Y)}{\left[ \text{Var}(X) \text{Var}(Y) \right]^{1/2} } $$ and $\widehat{\sigma}_{x}, \widehat{\sigma}_{y}$ are estimates of the standard deviations of $X$ and $Y$ respectively.

Another assumption of ordinary linear regression is that the mean of residual term $U$ is zero, conditional on any value of $X$: \begin{equation} \mathbb{E}[U \mid X] = 0 \tag{1} \end{equation}

This also implies a slightly weaker condition that $\text{Cov}(U, X) = 0$.

Now to answer your questions. Observation $\widehat{u}_i = x_{i}$ violates the assumptions of regression analysis stated in Eq. (1).

This could mean that the unobserved factors in $U$ are correlated with $X$: $$ \text{Cov}(U, X) \neq 0 $$ You can think about what is likely to be in $U$. If your thoughts, in particular (b) is true, then there exists some random variable $W$ that is influencing $Y$ and is thus in $U$ such that $$ \mathbb{E}[W \mid X = x_0] \neq \mathbb{E}[W \mid X = x_1] $$ for some values $x_0$ and $x_1$ of $X$. Therefore $U$ and $X$ are correlated.

All in all, when the assumptions of regression analysis are violated, the estimate $\hat{\beta_1}$ and consequently the estimate for the correlation between $X$ and $Y$ cannot be interpreted as usual anymore.

For regression analysis to make sense, we have to collect the observations of random variables influencing $Y$ into a matrix $X$ and perform multiple linear regression $$ Y = X \beta + U $$ and check that $U \bot X$ and also \begin{align*} &\mathbb{E}[U] = 0 \\ &\text{Var}(U) = \sigma^{2} I \end{align*} are satisfied. These are so-called Gauss-Markov conditions. If not, the entire model may be transformed, such that Gauss-Markov conditions are satisfied. For example by using weighted least squares or by using generalized linear model.