I just read on several blogs something at the form: Variable Importance using permutation will lead to a bias if the variables exhibit correlation. It is for instance stated by https://blog.methodsconsultants.com/posts/be-aware-of-bias-in-rf-variable-importance-metrics/ that
"The mean decrease in impurity and permutation importance computed from random forest models spread importance across collinear variables. For example, if you duplicate a feature and re-evaluate importance, the duplicated feature pulls down the importance of the original, so they are close to equal in importance."
In the article by Strobl et. al (2008): https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307 it is argued that correlated variables will show too high variable importance, where it is stated
"We know that the original permutation importance overestimates the importance of correlated predictor variables."
Furthermore, it is described in https://scikit-learn.org/stable/modules/permutation_importance.html that
"When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important."
The three quotes seem rather contradicting. The second quote states that correlated variable will show too high variable importance, where the third states that the variable importance will be too low. The quote agrees with this.
Do anyone know what is true? Is the variable importance overestimated or underestimated when variables are correlated?
The blog quote is identical to what was stated in the scikit learn and based on my personal experiments that are true however I also found the second quote confusing.
let me share my experiments to make that point clear.
t-test score is a distance measure feature ranking approach which is calculated for 186 features for a binary classification problem in the following figure. the higher the value of t-score the better the feature is
looking into it we can obviously see that the best features are in the range of 45 and it neighboring while the less informative features are in the range of 90 to 100. now let's look into the correlation between the features in the following figure.
and also the following figure is showing the classwise distribution of this binary class.
looking into the correlation figure, it is obvious that features in the range of 90 to 100 have the minimum correlation while other ranges of features that were highly informative are highly correlated. now lest having a look into the random forest feature importance calculated based on permutation importance using scikit learn in the following figure.
now all the feature which were informative are actually downgraded due to correlation among them and the feature which were not informative but were uncorrelated are identified as more important features. therefore we can conclude that using random forest feature selection approach for datasets with highly correlated features are not a suitable choice.