We have been trying out anomaly detection using isolation forest. The model present us with anomaly in our data. We are providing about 25 feature. We want to understand contribution of each feature in as a cause for anomaly. For this we decided to use SHAPLEY values to figure out feature importance. So we plotted the SHAPLEY value at the instance of anomaly and the summary of SHAPLEY value using tree explainer. Actual values with Anomaly timestamp marked The SHAPLEY values were calculated using two methods.
- By providing Isolation forest to the “tree explainer” Shapley values with Isolation forest
- By training an xgboost regressor with feature values as inputs and anomaly scores as predictions, to the “tree explainer”. ( Followed some example online.) Shaply Values with XGBOOST regressor As shown in the diagram both the plots represent 180 influence of the same feature. I understand these are two different models and the result would differ but to what extent
I was unable to correlate. I tried to observe change in the features by plotting percentage change at the point of anomaly. But doesn’t correlate with what we see in SHAPLEY values. Percentage change plotted for the time stamp
Need guidance on how to interpret the SHAPLEY values and establish correlation. Currently i assume that a higher correlation ( negative or positive ) in SHAPLEY value would be justified by change in absolute value of the corresponding feature i.e. If SHAPLEY is positive and absolute value of change is positive for the feature then in the anomaly score ,contribution of feature should also be positive.