Comparing polynomial and linear regression

823 Views Asked by At

I am tasked with finding a good model for some provided data. A quadratic model has been given and we are asked to compare the quadratic model with a simple linear regression. When comparing two linear regression models, I would normally check the residual plots (i.e. standardised residuals vs independent variable, standardised residuals vs fitted value) and normal QQ plots as well as checking the $R^2$ value and seeing which is higher. Would this methodology work for deciding whether the quadratic or simple linear model is better?

1

There are 1 best solutions below

6
On

I am interpreting 'quadratic model' to mean a linear regression with a quadratic term, i.e. $Y=\alpha + \beta X + \gamma X^2 + \epsilon$.

Generally, a quadratic model will always fit the given data better, but it might result in overfitting, which is not desirable. So in principle, we are comparing a more 'parsimonious' simple linear regression model with a heavier but possibly overfitting quadratic model.

The above idea of overfitting is related to your question regarding $R^2$. $R^2$ will almost always be higher (unless the data fits perfectly with a linear fit) in the quadratic model than the simple linear regression. If you wish to account for this, a metric such as Adjusted $R^2$ (which penalizes the inclusion of extra variables that do not lead to large incremental increases in $R^2$) might be more appropriate.

An example where a quadratic model is arguably better than a linear model is the regression of annual wage income on age. The reason why the quadratic model is better, is that income actually peaks around middle age (55-65 years old) and then drops afterwards, so it is not truly linear.

  • A quadratic model allows for a maximum to be taken around 55-65 years old. [A key difference between the two models is that a quadratic model allows for a varying slope, whereas the linar model has a fixed slope.]
  • Contrarily, in the linear model, the low wages of very young people (showing a positive correlation between age and wage) and very old people (showing a negative correlation between age and wage) would partially cancel out each other's effects, and thus lead to a flatter slope overall, which is not representative of young age and middle age (where the slope should be positive and relatively large) or old age (where the slope should be negative).

[Edit in response to comment]

Residual plots can indeed be useful for getting intuition about whether the modeling assumptions hold. For example, a great way to visualize the issue described in the second bulletpoint above is the residual plot below:

Residual plot indicating that nonlinear fit is probably best.

I took the screenshot above from this post, which has a very detailed discussion of how to use residual plots to gauge whether modeling assumptions hold. If your residual plots are like the leftmost screenshot in that post (the one labeled 'no problem'), then linear fit is probably best. If your residual plots are like the rightmost screenshot in that post (the one labeled 'nonlinear), then quadratic fit is probably best. [The heteroskedasticity case shown in the middle screenshot of that post is not very relevant to the question you are asking.]