Measure dispersion of a set of values resides in a range

838 Views Asked by At

I want to know what is best method to calculate (measure) the statistical dispersion of a set of values resides between a range.

Scenario:

My goal is to build an index. I have two methods that generate index values (index codes) and I want to find out which method creates the best one. An index is considered better, if its values are spread evenly or more across the index space.

eg: let $I_1=\{20, 40, 60, 80, 100\}$ and $I_2=\{54, 55, 56, 57, 59\}$ be two indexes. Then $I_1$ is better than $I_2$ because it has spread more across the index space (0 to 100).

Index space has an upper bound ($H$) and a lower bound ($L$). All index values resides between these bounds ($L \le i_j \le H$).

So I want a statistical measure that can use to correctly identify the "more scattered" index.

Thank you in advance.

1

There are 1 best solutions below

8
On BEST ANSWER

Since you know the upper ($H$) and lower ($L$) bounds of your index range, the "ideal" index distribution would be discrete-uniform on $[L,H]$.

Unfortunately, the standard deviation, skew, or kurtosis will only partially characterize what you are looking for. I have actually had to deal with the same issue as you (that of finding a maximally "even" distribution.)

I've developed the following measure of "uniformity", that I've found helpful. Try it out and see if it meets your needs. It's calculated as below:

  1. Create the right-continuous empirical CDF of the indices $I_i$ (note, it will be a step function..keep this in mind as I discuss further).
  2. Calculate the euclidean distance ($d=\sqrt{a^2+b^2}$) between the top of one jump and the top of the next jump (i.e., the top of a step is the leftmost portion of the ECDF for a given value of the ECDF), call this $\delta(k_i,k_{i+1})$. Let $\delta(I_i)=p_0+\sum\limits_{j \in \{1...|I_i|-1\}} \delta(k_j,k_{j+1})$ where $p_0$ is the probability assigned to the leftmost point of the distribution.
  3. The equivalent distance for a perfectly uniform distribution would be $U(L,H)=\frac{1}{H-L+1}+(H-L)\sqrt{1+(H-L+1)^{-2}}$ since that is the distance of a diagonal line from the top of the the ECDF at $L$ to the top of the ECDF at $H$
  4. Calculate the uniformity index as $\Upsilon(I_i) = \frac{U[L,H]}{\delta(I_i)}$

From simple geometry, $0\leq \Upsilon(I_i) \leq 1$. You can rank your distributions from lowest value of $\Upsilon(I_i)$ (least uniform/even) to highest $\Upsilon(I_i)$ (most even).