Problem Setting:
Assuming that we have $p$ variables $x_1, x_2, ..., x_p$ and a response vector $y$ with length $n$, and the p-value is associated with the F-test with the null hypothesis $H_0: \beta_j = 0$, I am wondering whether we can simulate these cases:
(1). the $j$th variable has a p-value < 0.05 in the linear model with $y$ and all $p$ variables, but its p-value increases above 0.05 if we remove the $k$th variable from the model;
(2). the $j$th variable has a p-value > 0.05 in the linear model with $y$ and all $p$ variables, but its p-value decreases below 0.05 if we remove the $k$th variable from the model.
I think the second case the is multicollinearity problem, so under (2). we can simulate a multicollinearity dataset, and we expect remove one of the multicolinear column can achieve desired property in (2).
However, I can not figure out the first case that what kinds of problem related with it.
2026-03-28 06:22:59.1774678979
P-value changes based on removed variable?
76 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
$$ \begin{array}{|c|lc|ll|} \hline \,x_1 & & x_2 & & \quad y \\ \hline 1 & \quad & 1.112650 & \quad & \phantom{+}0.047847540\quad \\ 2 & & 1.773290 & & -0.274229966 \\ 3 & & 3.496441 & & \phantom{+}0.468329473 \\ 4 & & 4.048456 & & \phantom{+}0.083754819 \\ 5 & & 5.337986 & & \phantom{+}0.467638735 \\ 6 & & 5.871713 & & -0.273584947 \\ 7 & & 6.490826 & & -0.511360778 \\ 8 & & 7.410821 & & -0.746130949 \\ 9 & & 8.876228 & & \phantom{+}0.009768297 \\ 10 & & 9.818910 & & -0.258969079 \\ \hline \end{array} $$ What is happening here can be seen via some ANOVA output:
Here, if $y$ is regressed on $x_1$ or on $x_2,$ the predictor is not significant at all, but if it is regressed on both, then both are significant and the second one extremely so, regardless of which one is second. Do it in either order and the one that comes last is the one with the far smaller p-value.
The way I did this is I added a small error to $x_1$ to get $x_2,$ and then I created $y$ by adding a smaller error to the difference between $x_1$ and $x_2.$ A good statistician with this data set will identify the difference $x_1-x_2$ as the place where the information about $y$ is located.
This answers your first question. And this is clearly multicollinearity: $\operatorname{cor}(x_1,x_2) = 0.9948076.$
Next data set, addressing the second question: $$ \begin{array}{|c|lc|ll|} \hline \,x_1 & & x_2 & & \quad y \\ \hline 1 & & 0.8895828 & & 0.8627504 \\ 2 & & 1.9205511 & & 1.6195161 \\ 3 & & 3.4886563 & & 3.2512496 \\ 4 & & 3.8981606 & & 3.3274232 \\ 5 & & 5.3081763 & & 4.7576811 \\ 6 & & 5.8849265 & & 6.2406326 \\ 7 & & 6.4093155 & & 7.5652209 \\ 8 & & 7.9241227 & & 7.6195349 \\ 9 & & 9.4701478 & & 8.4085932 \\ 10 & & 10.3827500 & & 10.8381033 \\ \hline \end{array} $$
Here whichever one of $x_1,x_2$ comes first is extremely significant, and the second one, regardless of which is in the second position, falls short of significance at the $0.05$ level, but when either is the sole predictor, it is highly significant.
And this is also multicollinearity: $\operatorname{cor}(x_1,x_2)= 0.9940944.$
Here I created $x_1$ and $x_2$ in the same way, but this time I created $y$ just by adding a small error to $x_1.$