proof/explanation of Mann-Whitney U calculation

253 Views Asked by At

Brief example of Mann-Whitney U statistic. Imagine of having an series of As and Bs, e.g.:

A B A A B

This could represent the results of a race, where a member of Team A got the best result, a member of Team B second best, a member of Team A third best and so on. For this particular statistic, order is descendig, like so:

A B A A B
5 4 3 2 1 <-- worst result, 1 point
^
|
+----- best result, 5 points

U is then calculated as the number of wins in pairwise contests of members of Team A against Team B. In this case, A-who-got-1st place beats 2 Bs, and As in 3rd and 4th position beat a single B respectively; hence the U statistic is 4.

Now, the wiki page introduces a second, much quicker method:

$$ {\displaystyle U_{A}=R_{A}-{n_{A}(n_{A}+1) \over 2}\,\!} $$

Where $R_A$ is the sum of the ranks of $A$ (5+3+2=10) and $n_A$ is the size of $A$ (3). In this case:

$$ U_{A}= 10 - 6 = 4 $$

What is the way of deriving the quicker formula? I tried to pick up pencil and paper but failed, and each book on google books (plus the original paper) just writes it without any note/demonstration, as if it were obvious to even the most naive reader.

I suspect combinatorics is involved, but I cannot understand how.

1

There are 1 best solutions below

2
On BEST ANSWER

Let $A_1<\cdots< A_{n_A}$ be the 'scores' of the $n_A$ members of team $A$, ordered from worst (lowest rank) to best (highest rank), and $B_1<\cdots<B_{n_B}$ be the scores of the $n_B$ members of team $B$, also ordered from worst to best.

One way to compute the rank of a given member of team $A$ is to take that member's rank within team $A$, and add the number of members of team $B$ who scored worse. For example the worst team member has score $A_1$ and has rank $1$ within team $A$, and we boost this within-$A$ rank by adding the number of members in team $B$ who performed worse. If nobody in team $B$ did worse, then the rank stays at $1$. If one person in team $B$ did worse, then the rank increases to $2$, and so on.

Summing over each member of team $A$ gives the total sum of ranks of members of $A$: $$ R_A = \sum_{i=1}^{n_A} \left( i + \sum_{j=1}^{n_B} I(B_j < A_i) \right) $$ which simplifies to $$ R_A=\sum_{i=1}^{n_A} i + \sum_{i=1}^{n_A}\sum_{j=1}^{n_B} I(B_j<A_i) =\frac{n_A(n_A+1)}2 + U_A $$ since the sum $\displaystyle \sum_{i=1}^{n_A}\sum_{j=1}^{n_B} I(B_j<A_i)$ is precisely the number of pairwise contests between $A$ and $B$ in which the $A$ member beats the $B$ member.