Squaring Predictor and Response Variable in Multiple Regression Model

365 Views Asked by At

If I square both the response variable $(Y)$ and the predictor variable ($x_k$) in a multiple regression model ($y_i = \beta_0 + \sum_{j=1}^p x_j\beta_j +\epsilon_i$), should I include both the original predictor $(x_k)$ variable along with the squared predictor ($x_k^2$) variable? Or is this an issue due to multicollinearity?

2

There are 2 best solutions below

1
On

$X^2$ is not a linear transformation of $X$, however $X$ and $X^2$ are in general not independent (uncorrelated) as well, so high multicolineairy is a possible issue. Regarding whether to keep them both or not - not so much a statistical question. IMHO, keeping them both is important as sometimes the very inclusion of $X^2$ stems from the fact that you assume square function with minimum or maximum that is not at $x = 0$, i.e., $$ \mathbb{E}[Y|X] = \beta_0 + \beta_1x + \beta_2 x^2, $$ then $$ \frac{\partial}{\partial x} \mathbb{E}[Y|X]= \beta_1 +2\beta_2x=0, $$ so you can estimate the value of this extrema point by $$ \hat{x}_{ex} = -\frac{\hat{\beta}_1}{2\hat{\beta}_2}. $$

0
On

Here are Minitab plots for four regressions: $x$-values are 1 to 10; corresponding $Y$-values are

10.631, 18.479, 31.486, 48.778, 69.724, 94.167, 123.812, 157.434, 194.324, 235.911

Following my Comment, the point of this demonstration is that, when possible, it is best to try to discover the true relationship between $Y$'s and $x$'s rather than treating transformations as 'tricks' to get small residuals. This is inherently a classical, fully mathematical, small-data approach.

[Modern big-data approaches favor finding (possibly unknown) relationships that produce optimal predictions. Very roughly speaking, such approaches are validated by testing each method on 'holdouts' (not used in finding the relationships) to see how well it works, and on using other cross-validation methods, much discussed on our sister stat (cross-validated) site.]

(1) Regress $Y$ on $x:$ enter image description here

(2) Regress $Y^2$ on $x^2:$ enter image description here

(3) Regress $Y^2$ on $x$ and $x^2:$ enter image description here

(4) Regress $Y$ on $s$ and $x^2;$ this one matches (reasonably well) the model used to make the $Y$'s, and so is the best of the four: enter image description here

My model was $Y_i = 5 + 3x_i + 2x_i^2 + e_i,$ with $e_i \stackrel{iid}{\sim} \mathsf{Norm}(0, \sigma=.5).$

Even exploring simple models such as these four, there are several useful criteria for choosing the best model. The only objective criterion for "best" that is visible in these four figures is to choose the one with the largest R-Sq(adj).

In this simple example, the variability in the 'error term' $e$ is quite small relative to the values of $Y$, so it is possible to explain essentially all of the variability in $Y$-values in terms of regression on $x$ and $x^2.$ You should not expect such 'perfection' in your own situation.