Interpreting ANOVA on regression function to determine model appropriateness

41 Views Asked by At

I am working on a data set with two variables; age (x) and wage (y). I created a series of polynomial regression models to fit to the data and I am attempting to identify which model is best (and simplest) in predicting wage using age. I have learned that one way to determine this is to run an ANOVA on each model and interpret the corresponding p value/F statistic for each model. After running the ANOVA, I intuitively select models 2 and perhaps 3 as the models which meet my selection criteria. However, one of my reference sources suggests that model 4 would be the most appropriate model. Here is the output from the ANOVA:

Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1   2998 5022216                                    
2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
3   2996 4777674  1     15756   9.8888  0.001679 ** 
4   2995 4771604  1      6070   3.8098  0.051046 .  
5   2994 4770322  1      1283   0.8050  0.369682    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I don't understand why model 4 would be the best model in this instance, when there are models with higher F statistics and lower p values to select from. Would this have anything to do with model variability? I'm clearly missing something in my understanding and am hoping someone from the community can help. I realize this may come across as a machine learning problem and apologize if this is the incorrect section to post in.

Many thanks for your help all.