How do you determined what variables to remove from a regression model

71 Views Asked by At

I apologise for how vague this question may appear but I am not finding any resources online to help with this issue.

I have a data frame loaded into R and split into two separate data frames: training and testing.

My data is around diabetes and has 8 variables including "Glucose" which is the primary variable I'm creating the regressional model against.

I have produced a lm of Glucose against all 7 other variables but I am now struggling to select which one needs to be removed.

This is the current output of my model:


Call:
lm(formula = Glucose ~ Pregnancies + BloodPressure + SkinThickness + 
    Insulin + BMI + DiabetesPedigreeFunction + Age, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-68.652 -16.047  -3.082  13.346  75.723 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)              61.14240    9.67267   6.321 1.08e-09
Pregnancies               0.04819    0.63083   0.076  0.93917
BloodPressure             0.14300    0.12764   1.120  0.26356
SkinThickness             0.10747    0.18138   0.592  0.55403
Insulin                   0.12793    0.01291   9.911  < 2e-16
BMI                       0.11406    0.28488   0.400  0.68921
DiabetesPedigreeFunction  6.95952    4.16151   1.672  0.09562
Age                       0.63202    0.20269   3.118  0.00202
                            
(Intercept)              ***
Pregnancies                 
BloodPressure               
SkinThickness               
Insulin                  ***
BMI                         
DiabetesPedigreeFunction .  
Age                      ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.78 on 268 degrees of freedom
Multiple R-squared:  0.4036,    Adjusted R-squared:  0.3881 
F-statistic: 25.91 on 7 and 268 DF,  p-value: < 2.2e-16
```
1

There are 1 best solutions below

0
On

The question of model selection, i.e. picking what features to include in a model, doesn't have a uniform answer! There are a few helpful types of things to consider:

The first type of model selection technique is a stepwise technique. These include:

  1. Backward Elimination.

In backward elimination, you run a regression with the full model (as you did) and then remove the variable with the largest $p$-value (or based on some other criterion (e.g. AIC, which we will talk about later). You continue to do this until you are left only with features whose $p$-value lies below a predesignated threshold. This is a very common method of model selection in Economics.

  1. Forward Selection

In forward selection you start with a null model and then add whichever variable does best in terms of a metric you choose. This could be $p$-value or $AIC$ or something else. You stop adding once the next variable you add would be worse some pre-determined threshold.

  1. Stepwise Selection

This procedure is best done by a computer and it goes back and forth through the model space, adding and subtracting variables based on a criterion to be specified. This can be implemented in many ways.

If you are using R, you can implement all of these easily with the command:

step(model, direction='')

Where direction can be "forward", "backward" or "both".

There is another type of model selection technique that is based solely on a particular criterion. One of the most common is AIC and BIC. Both of these are basically criteria for minimizing the RSS, but with a penalty for the number of parameters you have. You can implement these in R in a good way by first using the leaps package, which has the command

regsubsets(formula, data=)

which will produce the best subset for each number of parameters and you can look at the summary of this to get a sense of what it outputs and then you can find the value of the AIC or BIC for each subset given in that output.

There are actually a couple other methods of model selection (and of possible shrinking of parameters), but these are some of the most used.

Hope this helps!