This is a minimum reproducible example my real simulation work for research has error variance, but I want to see what is happening in this case so I made this WIThOUT error variance so that I can see clearly
I attach r code
x1<-rnorm(999,0,1)
x2<-rnorm(999,0,1)
y <- x1+x2
iv1<-999999*x1
iv2<-999999*x2
cov(x1,x2) # nearly 0
cor(x1,x2) # nearly 0
cov(iv1,iv2) # very big,
cor(iv1,iv2) # nearly 0
summary(lm(y~x1+x2+x1*x2)) # interaction p=0.11
summary(lm(y~iv1+iv2+iv1:iv2)) #interaction significant.
Call:
lm(formula = y ~ x1 + x2 + x1 * x2)
Residuals:
Min 1Q Median 3Q Max
-3.959e-14 -1.150e-16 1.400e-17 9.100e-17 4.756e-14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.728e-17 6.269e-17 -1.233e+00 0.218
x1 1.000e+00 6.090e-17 1.642e+16 <2e-16 ***
x2 1.000e+00 6.474e-17 1.545e+16 <2e-16 ***
x1:x2 -1.027e-16 6.630e-17 -1.550e+00 0.122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.976e-15 on 995 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.631e+32 on 3 and 995 DF, p-value: < 2.2e-16
> summary(lm(y~iv1+iv2+iv1:iv2))
Call:
lm(formula = y ~ iv1 + iv2 + iv1:iv2)
Residuals:
Min 1Q Median 3Q Max
-4.137e-14 -1.010e-16 2.100e-17 1.050e-16 3.623e-14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.728e-17 5.581e-17 -1.385e+00 0.1665
iv1 1.000e-06 5.422e-23 1.844e+16 <2e-16 ***
iv2 1.000e-06 5.764e-23 1.735e+16 <2e-16 ***
iv1:iv2 -1.217e-28 5.902e-29 -2.062e+00 0.0395 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.759e-15 on 995 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.058e+32 on 3 and 995 DF, p-value: < 2.2e-16
when I make y, I did not put interaction effect. but why x1:x2 interaction (p=0.11) and why iv1:iv2 interaction is significant?
Why covariance matters?
duplicated from [https://stats.stackexchange.com/questions/579271/when-independent-variables-covary-not-correlate-in-the-regression]
It is because R cannot handle decimals that small. The model should be
$y=1/999999iv1+1/999999iv2$ for model 2. If you do iv1=99999v1 and iv2=99999v2, you will get a significant interaction much fewer times. But because the numbers are so extreme the software cannot handle it and it sometimes finds a false interaction. Note the scale of the response variable. y is a decimal around 0 to 2. iv1 is 100000 to 2000000. It correctly gets the coefficient for iv1 and iv2, 1/999999=10e-6. But then for iv1:iv2, it sometimes throws in an interaction coefficient of 10e-29 or some very small number and makes the interaction $\times$ iv1 $\times$ iv2 very, very small, like on the order of 10e-18. So sometimes it finds an interaction when there really shouldn't be.