Change in R-square as a result of linearly combining independent variables in linear regression

30 Views Asked by At

Can one improve the R-square in a linear regression by linearly combining some of the independent variables?

My intuition is that the fit gets (weakly) worse because the result is a more constrained regression. Below is a specific example.

(1) Regress $y$ on $x_1$ and $x_2$

(2) Regress $y$ on $x$, where $x=x_1+x_2$

Can $R^2$ from the second regression be better than that from the first regression?

Thank you.

1

There are 1 best solutions below

0
On

Multilinear regression is equivalent to a projection of $Y$ on the linear space spanned by the regressors (the columns of $X$). If you add a linear combination of regressors to $X$, you get the same linear space, hence the same projection, hence the same $R^2$ (see https://stats.stackexchange.com/questions/123651/geometric-interpretation-of-multiple-correlation-coefficient-r-and-coefficient). If you replace $X$ with $AX$, then it depends on the rank of $A$ as well as the possible preexisting linear relations between columns of $X$.

Note also that while the projection remains the same if $\mathrm{span} (X)$ doesn't change, the situation is different with coefficients: if there are linear relationships between columns of $X$, then $X$ doesn't have full column rank, $X^TX$ is not invertible, and the model is not identifiable. You may only estimate the coefficients up to one or more parameters (related: https://stats.stackexchange.com/questions/257778/sum-to-zero-constraint-in-one-way-anova). It's also why nearly linear relations betweens columns are a problem when estimating a linear model: they result in high variances for the coefficients (since they are the diagonal coefficients of $\sigma^2(X^TX)^{-1}$ and $X^TX$ is nearly singular).