In Linear Modelling Why Would I Keep $X_2$ but drop $X_2$ and $X_3$?

52 Views Asked by At

I'm having trouble with some intuition in Linear Statistical Modelling.

I'm working with some data with three predictor variables $X_1$, $X_2$ and $X_3$.

I've calculated the $F$ test for whether $X_2$ should be retained in the model containing just $X_1$ and $X_2$.

Similarly I've calculated the $F$ test for whether $X_2$ and $X_3$ should be retained in the model containing $X_1$, $X_2$ and $X_3$.

The conclusion from the first test is to retain $X_2$. The $p$-value is less than $0.05$.

The conclusion from the second test is to drop $X_2$ and $X_3$. The $p$-value is greater than $0.1$.

So my question is, is this not counter-intuitive? How could adding another variable $X_3$ make things worse for $X_2$ than it was with just $X_2$ alone?

My feeling is that I have made a calculation error because it seems the $p$-value should only get smaller with more information not larger.

Any insight into this matter would be greatly apprecaited, thank you!

I can't show you the exact data without getting fired. But here's a baby data set that illustrates the same issue. The $F$ test for whether you can drop $X_2$ in the model with $Y, X_1, X_2$ has $p$-value $0.06$ while the $F$ test for whether you can drop $X_2$ and $X_3$ in the model with $Y, X_1, X_2, X_3$ is $0.14$. So how can it get larger like that? This indicates you need $X_2$ on top of $X_1$ but you can get rid of both $X_2$ and $X_3$ on top of $X_1$. To me that's a contradiction. Am I doing something wrong here? $$ Y\ \ \ X_1\ \ \ X_2\ \ \ X_3\\ 48.0\ \ \ 50.0\ \ \ 51.0\ \ \ 2.3\\ 57.0\ \ \ 36.0\ \ \ 46.0\ \ \ 2.3\\ 66.0\ \ \ 40.0\ \ \ 48.0\ \ \ 2.2\\ 70.0\ \ \ 41.0\ \ \ 44.0\ \ \ 1.8\\ 89.0\ \ \ 28.0\ \ \ 43.0\ \ \ 1.8\\ 36.0\ \ \ 49.0\ \ \ 54.0\ \ \ 2.9\\ 46.0\ \ \ 42.0\ \ \ 50.0\ \ \ 2.2\\ 54.0\ \ \ 45.0\ \ \ 48.0\ \ \ 2.4\\ 26.0\ \ \ 52.0\ \ \ 62.0\ \ \ 2.9\\ 77.0\ \ \ 29.0\ \ \ 50.0\ \ \ 2.1\\ 89.0\ \ \ 29.0\ \ \ 48.0\ \ \ 2.4\\ 67.0\ \ \ 43.0\ \ \ 53.0\ \ \ 2.4\\ 47.0\ \ \ 38.0\ \ \ 55.0\ \ \ 2.2\\ 51.0\ \ \ 34.0\ \ \ 51.0\ \ \ 2.3\\ 57.0\ \ \ 53.0\ \ \ 54.0\ \ \ 2.2\\ 66.0\ \ \ 36.0\ \ \ 49.0\ \ \ 2.0\\ 79.0\ \ \ 33.0\ \ \ 56.0\ \ \ 2.5\\ 88.0\ \ \ 29.0\ \ \ 46.0\ \ \ 1.9\\ 60.0\ \ \ 33.0\ \ \ 49.0\ \ \ 2.1\\ 49.0\ \ \ 55.0\ \ \ 51.0\ \ \ 2.4\\ 77.0\ \ \ 29.0\ \ \ 52.0\ \ \ 2.3\\ 52.0\ \ \ 44.0\ \ \ 58.0\ \ \ 2.9\\ 60.0\ \ \ 43.0\ \ \ 50.0\ \ \ 2.3\\ $$

1

There are 1 best solutions below

4
On BEST ANSWER

One possible reason for this is high linear correlation between $X_2$ and $X_3$. Here is an example:

set.seed(111)
x_1 = rnorm(100, 10, 3)
x_2 = rnorm(100, 10, 3)

y  = - x_1 + x_2 + rnorm(100, 0, 17)  

m1 = lm( y ~ x_1 + x_2 )

x_3 = x_2 + rnorm(100, 0, 1)

m2 = lm( y ~ x_1 + x_2 + x_3 )

m0 = lm( y ~ x_1 )

anova(m0, m1)

anova(m0, m2)

Another possible reason is if $X_3$ is independent of $X_1,$, $X_2$ and $Y$.

Recall the form of the partial F test $$ F_{stat} = \frac{( SSRes(R) - SSRes(F) )/ r }{ MSE(F) }. $$ Denote the model with $X_1$ and $X_2$ by $F_1$ and the model with $X_1$, $X_2$ and $X_3$ by $F_2$ and the model with only $X_1$ by $R$. If $X_2$ is highly colinear with $X_3$ (or $X_3$ is independent of $X_1$ and $X_2$ and $Y$), then although $$ ( SSRes(R) - SSRes(F_2) ) > ( SSRes(R) - SSRes(F_1) ), $$ it may happen that $$ ( SSRes(R) - SSRes(F_2) ) / 3 < ( SSRes(R) - SSRes(F_1) ) /2, $$ while $MSE(F_2) > MSE(F_1)$, hence the partial $F_{stat}$ statistic for the first test may-be significant, while for the second insignificant. In other words, the F statistic is not monotonically increasing function of the number of variables, hence a highly colinear or independent variable that adds nothing but noise - may reduce the calculated F statistic, hence elevate the p.value.