If I square both the response variable $(Y)$ and the predictor variable ($x_k$) in a multiple regression model ($y_i = \beta_0 + \sum_{j=1}^p x_j\beta_j +\epsilon_i$), should I include both the original predictor $(x_k)$ variable along with the squared predictor ($x_k^2$) variable? Or is this an issue due to multicollinearity?
Squaring Predictor and Response Variable in Multiple Regression Model
365 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
Here are Minitab plots for four regressions: $x$-values are 1 to 10; corresponding $Y$-values are
10.631, 18.479, 31.486, 48.778, 69.724, 94.167, 123.812, 157.434, 194.324, 235.911
Following my Comment, the point of this demonstration is that, when possible, it is best to try to discover the true relationship between $Y$'s and $x$'s rather than treating transformations as 'tricks' to get small residuals. This is inherently a classical, fully mathematical, small-data approach.
[Modern big-data approaches
favor finding (possibly unknown) relationships that produce optimal predictions.
Very roughly speaking, such approaches are validated by testing each method on 'holdouts' (not used
in finding the relationships) to see how well it works, and on using other
cross-validation methods, much discussed on our sister stat (cross-validated) site.]
(3) Regress $Y^2$ on $x$ and $x^2:$

(4) Regress $Y$ on $s$ and $x^2;$ this one matches (reasonably well) the model used to make the $Y$'s, and so is the best of the four:

My model was $Y_i = 5 + 3x_i + 2x_i^2 + e_i,$ with $e_i \stackrel{iid}{\sim} \mathsf{Norm}(0, \sigma=.5).$
Even exploring simple models such as these four, there are several useful criteria for choosing the best model.
The only objective criterion for "best" that is visible in these four figures is
to choose the one with the largest R-Sq(adj).
In this simple example, the variability in the 'error term' $e$ is quite small relative to the values of $Y$, so it is possible to explain essentially all of the variability in $Y$-values in terms of regression on $x$ and $x^2.$ You should not expect such 'perfection' in your own situation.


$X^2$ is not a linear transformation of $X$, however $X$ and $X^2$ are in general not independent (uncorrelated) as well, so high multicolineairy is a possible issue. Regarding whether to keep them both or not - not so much a statistical question. IMHO, keeping them both is important as sometimes the very inclusion of $X^2$ stems from the fact that you assume square function with minimum or maximum that is not at $x = 0$, i.e., $$ \mathbb{E}[Y|X] = \beta_0 + \beta_1x + \beta_2 x^2, $$ then $$ \frac{\partial}{\partial x} \mathbb{E}[Y|X]= \beta_1 +2\beta_2x=0, $$ so you can estimate the value of this extrema point by $$ \hat{x}_{ex} = -\frac{\hat{\beta}_1}{2\hat{\beta}_2}. $$