I've just started studying multiple linear regression and I'm stuck at creating a dataset for which a multiple regression model would have a low $p$-value (the coefficients are non zero) and also low explained variation $R^2$. The following is what I've come up.
To create a dataset with
- low explained variation it is necessary that the output variable is not in linear correspondence with the predictor variables.
- very low $p$-value of coefficients it is necessary that the coefficients are non zero. If predictors are independent the coefficient $i$th $a_i$ is equal to $\rho(X_i,Y)$ (correlation between $X_i$ and $Y$). If $X_i$ and $Y$ are not connected by linear relation, the coefficient $\rho$ is almost zero so this scenario has to be excluded. On the other hand, if I define $Y$ so has to have linear relation to $X_i$ for every $i$, the condition on explained variation not holds anymore. However, if the predictors are not independent, although I know the formula for the coefficients, it's not very clear to me how to proceed.
The function does not have to be nonlinear to get the results you specify. Here's one way using R of which the two most important parts are a large sample size and a large random error relative to the prediction function (
x1 - x2in this case):The $R^2$ value is estimated to be 0.01008 and the model P-value is 3.225e-06.