Linear regression: is near multicollinearity really a problem ?

68 Views Asked by At

I have the following model for continuous variables $$Y = \beta +\beta_1 X1 + \beta_2 X2 + \beta_3 X3 + \beta_4 X4 + \beta_5 X5 + \epsilon$$

Everything works out very well, the model passes all kinds of tests, respects all assumptions but one: $X1$ and $X2$ are highly correlated (>0.9) so maybe we have multicollinearity. They have huge VIF (around 20). When I remove $X2$ from the model, the VIF of each variable is under 4, and I obtain similar results but slightly worse. In every kind of criteria.

So if I follow theory, I should remove $X2$. But practical tests shows that the first model is slightly better (just a tiny bit). What should I do ? Why are the models behaving like this in opposition to what theory says ?

1

There are 1 best solutions below

0
On BEST ANSWER
  1. There is no assumptions of uncorrelated $X$s, hence it does not violate any theoretical consideration.

  2. However, high correlation (surely huge VIF as $20$) is important practical consideration regarding model's stability.

  3. If you are interested in the predictive power and not worried at all about the inferential one, then it should not bother you much.

  4. However, if you are interested in dealing with the multicolinearity with a little more sophisticated way than just dropping out $X_2$, then you can:

(i) Perform principle components regression on the subsets of $k$ best PCs. https://en.wikipedia.org/wiki/Principal_component_regression

(ii) Perform ridge regression: https://en.wikipedia.org/wiki/Tikhonov_regularization