P-value changes based on removed variable?

76 Views Asked by At

Problem Setting:
Assuming that we have $p$ variables $x_1, x_2, ..., x_p$ and a response vector $y$ with length $n$, and the p-value is associated with the F-test with the null hypothesis $H_0: \beta_j = 0$, I am wondering whether we can simulate these cases:
(1). the $j$th variable has a p-value < 0.05 in the linear model with $y$ and all $p$ variables, but its p-value increases above 0.05 if we remove the $k$th variable from the model;
(2). the $j$th variable has a p-value > 0.05 in the linear model with $y$ and all $p$ variables, but its p-value decreases below 0.05 if we remove the $k$th variable from the model.
I think the second case the is multicollinearity problem, so under (2). we can simulate a multicollinearity dataset, and we expect remove one of the multicolinear column can achieve desired property in (2).
However, I can not figure out the first case that what kinds of problem related with it.

1

There are 1 best solutions below

0
On BEST ANSWER

$$ \begin{array}{|c|lc|ll|} \hline \,x_1 & & x_2 & & \quad y \\ \hline 1 & \quad & 1.112650 & \quad & \phantom{+}0.047847540\quad \\ 2 & & 1.773290 & & -0.274229966 \\ 3 & & 3.496441 & & \phantom{+}0.468329473 \\ 4 & & 4.048456 & & \phantom{+}0.083754819 \\ 5 & & 5.337986 & & \phantom{+}0.467638735 \\ 6 & & 5.871713 & & -0.273584947 \\ 7 & & 6.490826 & & -0.511360778 \\ 8 & & 7.410821 & & -0.746130949 \\ 9 & & 8.876228 & & \phantom{+}0.009768297 \\ 10 & & 9.818910 & & -0.258969079 \\ \hline \end{array} $$ What is happening here can be seen via some ANOVA output:

anova(lm(y~x2))
Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value Pr(>F)
y          1 0.16718 0.16718  1.0979 0.3253
Residuals  8 1.21815 0.15227
           
> anova(lm(y~x1))
Analysis of Variance Table

Response: w
          Df  Sum Sq Mean Sq F value Pr(>F)
x          1 0.26618 0.26618  1.9027 0.2051
Residuals  8 1.11915 0.13989 
          
> anova(lm(w~x+y))
Analysis of Variance Table

Response: w
          Df  Sum Sq Mean Sq F value    Pr(>F)    
x          1 0.26618 0.26618  27.621  0.001179 ** 
y          1 1.05170 1.05170 109.134 1.603e-05 ***
Residuals  7 0.06746 0.00964                      

> anova(lm(w~y+x))
Analysis of Variance Table

Response: w
          Df  Sum Sq Mean Sq F value    Pr(>F)    
y          1 0.16718 0.16718  17.348  0.004215 ** 
x          1 1.15069 1.15069 119.407 1.189e-05 ***
Residuals  7 0.06746 0.00964

Here, if $y$ is regressed on $x_1$ or on $x_2,$ the predictor is not significant at all, but if it is regressed on both, then both are significant and the second one extremely so, regardless of which one is second. Do it in either order and the one that comes last is the one with the far smaller p-value.

The way I did this is I added a small error to $x_1$ to get $x_2,$ and then I created $y$ by adding a smaller error to the difference between $x_1$ and $x_2.$ A good statistician with this data set will identify the difference $x_1-x_2$ as the place where the information about $y$ is located.

This answers your first question. And this is clearly multicollinearity: $\operatorname{cor}(x_1,x_2) = 0.9948076.$

Next data set, addressing the second question: $$ \begin{array}{|c|lc|ll|} \hline \,x_1 & & x_2 & & \quad y \\ \hline 1 & & 0.8895828 & & 0.8627504 \\ 2 & & 1.9205511 & & 1.6195161 \\ 3 & & 3.4886563 & & 3.2512496 \\ 4 & & 3.8981606 & & 3.3274232 \\ 5 & & 5.3081763 & & 4.7576811 \\ 6 & & 5.8849265 & & 6.2406326 \\ 7 & & 6.4093155 & & 7.5652209 \\ 8 & & 7.9241227 & & 7.6195349 \\ 9 & & 9.4701478 & & 8.4085932 \\ 10 & & 10.3827500 & & 10.8381033 \\ \hline \end{array} $$

> anova(lm(y~x1+x2))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq  F value    Pr(>F)    
x1         1 91.050  91.050 312.3273 4.578e-07 ***
x2         1  0.035   0.035   0.1193    0.7399    
Residuals  7  2.041   0.292                       

> anova(lm(y~x2+x1))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x2         1 89.594  89.594 307.334 4.838e-07 ***
x1         1  1.491   1.491   5.113   0.05823 .  
Residuals  7  2.041   0.292 

> anova(lm(y~x1))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x1         1 91.050  91.050  350.96 6.807e-08 ***
Residuals  8  2.075   0.259                      

> anova(lm(y~x2))
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x2         1 89.594  89.594  202.98 5.741e-07 ***
Residuals  8  3.531   0.441

Here whichever one of $x_1,x_2$ comes first is extremely significant, and the second one, regardless of which is in the second position, falls short of significance at the $0.05$ level, but when either is the sole predictor, it is highly significant.

And this is also multicollinearity: $\operatorname{cor}(x_1,x_2)= 0.9940944.$

Here I created $x_1$ and $x_2$ in the same way, but this time I created $y$ just by adding a small error to $x_1.$