If I have a dataset, in this case the diabetes toy dataset, and am running a linear regression model, could someone explain what I should expect in terms of performance if I were to conduct the regression analysis with just the 'statistically significant' factors vs. using the entire dataset.
My intuition would tell me that using the entire dataset should be at least as good as using just the statistically significant features given the added information but this appears to not be the case as I see a ~7% reduction in the MSE for the statistically significant feature set.
Just for completeness, I evaluated the statistical significance using a t-test and computing p-values for each of the features and found that the patient's age, sex, BMI and the s5 feature were significant.
Would be keen to hear the communities thoughts here.