A formula to study the stability of a multivariant variable was given to me.
The formula is introduced below:
$$ \sum_{i=1}^k (p_{i,2}-p_{i,1})\log \bigl(\frac{p_{i,2}}{p_{i,1}}\bigl) $$
Where:
- $p_{i,j}$ is the relative frequency of the observed value $i$ in the sample $j$.
- $j$ refers to the beginning of the relevant observation period ($j=1$) and the end of the relevant observation period ($j=2$) respectively.
- $k$ is the number of facility grades/pools or segments.
The final target of this analysis is to decide if the result of the formula shown above is reasonable in terms of variation. For this, I have to establish a threshold. The problem is that I do not know the origin of the formula. I assume that it is deduced to supposing some kind of statistical distribution for the sample.
Thank you in advance and my apologies for the possible mathematical incongruences that you could find. Please, do not doubt to reach me if any doubts arise.
Your formula is the symmetrized Kullback-Leibler divergence, also known as the Jeffreys divergence. In the first link you'll find the assertion "In the Banking and Finance industries, this quantity is referred to as Population Stability Index, and is used to assess distributional shifts in model features through time." See this Cross Validated post for a discussion.
(IMO there is very little justification for the thresholds $0.10$ and $0.25$ commonly cited when using the PSI. These thresholds seem to be folklore passed from one practitioner to the next, with no empirical/theoretical evidence to illustrate how a threshold relates to stability, and no regard for how the PSI behaves when you change $k$, the number of segments.)