low $p$-value and low explained variation connection in multiple regression analysis

25 Views Asked by At

I've just started studying multiple linear regression and I'm stuck at creating a dataset for which a multiple regression model would have a low $p$-value (the coefficients are non zero) and also low explained variation $R^2$. The following is what I've come up.

To create a dataset with

  • low explained variation it is necessary that the output variable is not in linear correspondence with the predictor variables.
  • very low $p$-value of coefficients it is necessary that the coefficients are non zero. If predictors are independent the coefficient $i$th $a_i$ is equal to $\rho(X_i,Y)$ (correlation between $X_i$ and $Y$). If $X_i$ and $Y$ are not connected by linear relation, the coefficient $\rho$ is almost zero so this scenario has to be excluded. On the other hand, if I define $Y$ so has to have linear relation to $X_i$ for every $i$, the condition on explained variation not holds anymore. However, if the predictors are not independent, although I know the formula for the coefficients, it's not very clear to me how to proceed.
1

There are 1 best solutions below

0
On

The function does not have to be nonlinear to get the results you specify. Here's one way using R of which the two most important parts are a large sample size and a large random error relative to the prediction function (x1 - x2 in this case):

# Generate a grid of predictors
n = 50
x1 = rep(c(1:n)/n, each=n) 
x2 = rep(c(1:n)/n, time=n)

# Generate response variable
# (A linear function of x1 and x2)
y = x1 - x2 + rnorm(n, 0, 3)

# Perform regression and show summary
results = lm(y ~ x1 + x2)
summary(results)

Call: lm(formula = y ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8591 -2.7944  0.5241  1.9112  6.4362 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)     (Intercept)   0.7447     0.1823   4.086 4.53e-05 *** x1            1.0000     0.2346   4.262 2.10e-05 *** x2           -0.6317     0.2346  -2.693  0.00714 ** 
--- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.386 on 2497 degrees of freedom Multiple R-squared:  0.01008,   Adjusted R-squared:  0.009284  F-statistic:
12.71 on 2 and 2497 DF,  p-value: 3.225e-06

The $R^2$ value is estimated to be 0.01008 and the model P-value is 3.225e-06.