I apologise for how vague this question may appear but I am not finding any resources online to help with this issue.
I have a data frame loaded into R and split into two separate data frames: training and testing.
My data is around diabetes and has 8 variables including "Glucose" which is the primary variable I'm creating the regressional model against.
I have produced a lm of Glucose against all 7 other variables but I am now struggling to select which one needs to be removed.
This is the current output of my model:
Call:
lm(formula = Glucose ~ Pregnancies + BloodPressure + SkinThickness +
Insulin + BMI + DiabetesPedigreeFunction + Age, data = training)
Residuals:
Min 1Q Median 3Q Max
-68.652 -16.047 -3.082 13.346 75.723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 61.14240 9.67267 6.321 1.08e-09
Pregnancies 0.04819 0.63083 0.076 0.93917
BloodPressure 0.14300 0.12764 1.120 0.26356
SkinThickness 0.10747 0.18138 0.592 0.55403
Insulin 0.12793 0.01291 9.911 < 2e-16
BMI 0.11406 0.28488 0.400 0.68921
DiabetesPedigreeFunction 6.95952 4.16151 1.672 0.09562
Age 0.63202 0.20269 3.118 0.00202
(Intercept) ***
Pregnancies
BloodPressure
SkinThickness
Insulin ***
BMI
DiabetesPedigreeFunction .
Age **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.78 on 268 degrees of freedom
Multiple R-squared: 0.4036, Adjusted R-squared: 0.3881
F-statistic: 25.91 on 7 and 268 DF, p-value: < 2.2e-16
```
The question of model selection, i.e. picking what features to include in a model, doesn't have a uniform answer! There are a few helpful types of things to consider:
The first type of model selection technique is a stepwise technique. These include:
In backward elimination, you run a regression with the full model (as you did) and then remove the variable with the largest $p$-value (or based on some other criterion (e.g. AIC, which we will talk about later). You continue to do this until you are left only with features whose $p$-value lies below a predesignated threshold. This is a very common method of model selection in Economics.
In forward selection you start with a null model and then add whichever variable does best in terms of a metric you choose. This could be $p$-value or $AIC$ or something else. You stop adding once the next variable you add would be worse some pre-determined threshold.
This procedure is best done by a computer and it goes back and forth through the model space, adding and subtracting variables based on a criterion to be specified. This can be implemented in many ways.
If you are using R, you can implement all of these easily with the command:
step(model, direction='')Where direction can be "forward", "backward" or "both".
There is another type of model selection technique that is based solely on a particular criterion. One of the most common is AIC and BIC. Both of these are basically criteria for minimizing the RSS, but with a penalty for the number of parameters you have. You can implement these in R in a good way by first using the leaps package, which has the command
regsubsets(formula, data=)which will produce the best subset for each number of parameters and you can look at the summary of this to get a sense of what it outputs and then you can find the value of the AIC or BIC for each subset given in that output.
There are actually a couple other methods of model selection (and of possible shrinking of parameters), but these are some of the most used.
Hope this helps!