When independent variables covary (not correlate) in the regression, why this happen?

38 Views Asked by At

This is a minimum reproducible example my real simulation work for research has error variance, but I want to see what is happening in this case so I made this WIThOUT error variance so that I can see clearly

I attach r code

x1<-rnorm(999,0,1)
x2<-rnorm(999,0,1)
y <- x1+x2
iv1<-999999*x1
iv2<-999999*x2

cov(x1,x2) # nearly 0
cor(x1,x2) # nearly 0
cov(iv1,iv2) # very big, 
cor(iv1,iv2) # nearly 0

summary(lm(y~x1+x2+x1*x2)) # interaction p=0.11

summary(lm(y~iv1+iv2+iv1:iv2)) #interaction significant. 


Call:
lm(formula = y ~ x1 + x2 + x1 * x2)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.959e-14 -1.150e-16  1.400e-17  9.100e-17  4.756e-14 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.728e-17  6.269e-17 -1.233e+00    0.218    
x1           1.000e+00  6.090e-17  1.642e+16   <2e-16 ***
x2           1.000e+00  6.474e-17  1.545e+16   <2e-16 ***
x1:x2       -1.027e-16  6.630e-17 -1.550e+00    0.122    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.976e-15 on 995 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.631e+32 on 3 and 995 DF,  p-value: < 2.2e-16

> summary(lm(y~iv1+iv2+iv1:iv2))

Call:
lm(formula = y ~ iv1 + iv2 + iv1:iv2)

Residuals:
       Min         1Q     Median         3Q        Max 
-4.137e-14 -1.010e-16  2.100e-17  1.050e-16  3.623e-14 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.728e-17  5.581e-17 -1.385e+00   0.1665    
iv1          1.000e-06  5.422e-23  1.844e+16   <2e-16 ***
iv2          1.000e-06  5.764e-23  1.735e+16   <2e-16 ***
iv1:iv2     -1.217e-28  5.902e-29 -2.062e+00   0.0395 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.759e-15 on 995 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.058e+32 on 3 and 995 DF,  p-value: < 2.2e-16


when I make y, I did not put interaction effect. but why x1:x2 interaction (p=0.11) and why iv1:iv2 interaction is significant?

Why covariance matters?

duplicated from [https://stats.stackexchange.com/questions/579271/when-independent-variables-covary-not-correlate-in-the-regression]

1

There are 1 best solutions below

2
On BEST ANSWER

It is because R cannot handle decimals that small. The model should be

$y=1/999999iv1+1/999999iv2$ for model 2. If you do iv1=99999v1 and iv2=99999v2, you will get a significant interaction much fewer times. But because the numbers are so extreme the software cannot handle it and it sometimes finds a false interaction. Note the scale of the response variable. y is a decimal around 0 to 2. iv1 is 100000 to 2000000. It correctly gets the coefficient for iv1 and iv2, 1/999999=10e-6. But then for iv1:iv2, it sometimes throws in an interaction coefficient of 10e-29 or some very small number and makes the interaction $\times$ iv1 $\times$ iv2 very, very small, like on the order of 10e-18. So sometimes it finds an interaction when there really shouldn't be.