Given two samples $X$ and $Y$ where $X$ has $X_1,...,X_n$ independent r.v's with unknown distribution $F$ and $Y$ has $Y_1,...,Y_m$ independent r.v's with unknown distribution $G$, find the expected value and variance of
$$R=\sum_{i=1}^{n+m}iA_i$$
(where $A_i$ is $1$ iff the $i$th smallest value of $X$ and $Y$ belongs to $X$ and is zero otherwise) assuming that $F=G$
If $F=G$ then I believe that implies that we have $n+m$ independent and identically distributed random variables, and therefore $\mathbb{P}(A_i=1)=\frac {n}{n+m}$ making $\mathbb{E}[R]=\frac {n(n+m+1)}{2}$. The variance causes me more trouble because $R$ is a sum of random variables and I am therefore not sure if using $\mathbb{E}\left[\binom{R}{2}\right]$ to calculate $\mathbb{E}[R^2]$ would be a good idea or if I need to expand $R^2$ and then use linearity of expectation. I could also perhaps use $$\mathrm{Var}[R]=\sum_{i=1}^{n+m}\sum_{j=1}^{n+m}ij\mathrm{Cov}(X_i,X_j)$$
But I still do not know how to calculate the $\mathbb{E}[X_i]$, much less the covariance.
I am not so certain about either of these results and even if they were correct, I am not satisfied with the way I obtained them. Is there a more rigorous and systematic way of getting to these answers? What if I had had to compute the expected value and variance of $R$ when $F\ne G$?
2026-03-27 05:38:09.1774589889
Finding the Wilcoxon Sum of Ranks test expected value and variance
818 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
Let's start with the expectation. You are indeed right, $\mathbb{E}[R] = \frac{n(n+m+1)}{2}$, and your approach to the problem seems pretty rigorous (without resorting to all sorts of measure-theoretic stuff) $-$ I would have solved it the same way.
For the sake of completeness, let's show another approach to finding the expectation here. Define $A_{ij}$ to be the indicator variable for "$i$th smallest value in the $X \cup Y$ sample is the $j$th smallest value in the $X$ sample", and notice $$ \mathbb{P}\{A_i = 1\} = \mathbb{P}\{\cup_{j=1}^n \{A_{ij} = 1\}\} = \sum_{j=1}^n \mathbb{P}\{A_{ij} = 1\}, $$ where the last equality comes from the set $\{A_{ij}\}_{j=1}^n$ being mutuall exclusive for any fixed $i$. Note that $A_{ij}$ is constantly zero for $i < j$, but that shouldn't worry us much.
Now, our next step is to notice that $A_{ij}$ is tightly related to negative hypergeometric distribution: letting $H_j$ to be a negative hypergeometric random variable with $n$ "special balls", $m$ "ordinary balls" and $j$ "special balls to be selected", it's fairly easy to see that $A_{ij} = 1 \iff H_j = i$. Given this, it's just a matter of a bit of arithmetic: $$ \mathbb{E}R = \mathbb{E}\sum_{i=1}^{n+m}iA_i = \sum_{i=1}^{n+m}i\mathbb{E}A_i = \sum_{i=1}^{n+m}i\mathbb{P}\{A_i = 1\}. $$ Expanding $\mathbb{P}\{A_i = 1\}$ and interchanging the order of summation yields $$ \mathbb{E}R = \sum_{i=1}^{n+m}i\sum_{j=1}^n \mathbb{P}\{A_{ij} = 1\} = \sum_{j=1}^n\sum_{i=1}^{n+m} i\mathbb{P}\{A_{ij} = 1\}. $$
Now we shall notice that, for a fixed $j$, $\sum_{i=1}^{n+m} i\mathbb{P}\{A_{ij} = 1\}$ is just $\mathbb{E}H_j = j\frac{n+m+1}{n+1}$, so, finally $$ \mathbb{E}R = \sum_{j=1}^n j\frac{n+m+1}{n+1} = \frac{n+m+1}{n+1} \sum_{j=1}^n j = \frac{n+m+1}{n+1} \frac{n(n+1)}{2} = \frac{n(n+m+1)}{2}, $$ as expected.
Regarding variance, I'm not sure you would be able to sanely use $\mathbb{E}[\binom{R}{2}]$, as $R$ is not just a sum of indicator variables, but rather a weighted sum. I also didn't have any kind of insight that would allow to do the same trick as above to reduce this to some known distribution, so let's just go the bruteforce way: $$ \mathbb{D}R = \mathbb{D}\sum_{i=1}^{n+m}iA_i = \sum_{i=1}^{n+m}i^2\mathbb{D}A_i + 2 \sum_{i < j} ij \text{Cov}(A_i, A_j). $$
Here, $\mathbb{D}A_i = \mathbb{E}A_i^2 - (\mathbb{E}A_i)^2$. Notice that $\mathbb{E}A_i^2 = \mathbb{E}A_i$, since $A_i$ is an indicator variable, so, $\mathbb{D}A_i = \frac{nm}{(n + m)^2}$, and, after doing some simplifications, we get $$ \sum_{i=1}^{n+m}i^2\mathbb{D}A_i = \frac{1}{6} \frac{nm (n + m + 1) (2n + 2m + 1)}{n+m}. $$
To compute $\text{Cov}(A_i, A_j) = \mathbb{E}A_iA_j - \mathbb{E}A_i\mathbb{E}A_j$, using an argument similar to yours we see that $\mathbb{E}A_iA_j = \frac{n(n - 1)}{(n + m)(n + m - 1)}$, so $\text{Cov}(A_i, A_j) = -\frac{nm}{(n + m)^2 (n + m - 1)}$. Note that covariance is negative, which makes perfect sense: indeed, if we know one of the $X$ samples is at position $i$, then we have one less option (and thus smaller probability) for having another $X$ sample at position $j$. Anyway, the second term in the sum above becomes $$ 2 \sum_{i < j} ij \text{Cov}(A_i, A_j) = -2\frac{nm}{(n + m)^2 (n + m - 1)} \sum_{i < j} ij. $$
Using some symbolic algebra system to compute $$ \sum_{j=1}^{m + n} \sum_{i=1}^{j - 1} ij = \frac{1}{24} (m + n - 1) (m + n) (m + n + 1) (3 m + 3 n + 2) $$ and substituting that into the above, we get $$ 2 \sum_{i < j} ij \text{Cov}(A_i, A_j) = -\frac{1}{12} \frac{nm}{n + m} (n + m + 1) (3m + 3n + 2). $$
Summing it all up and doing the required arithmetic, we finally get $$ \mathbb{D}R = \frac{1}{12} nm (n + m + 1). $$