Probability of Overlap of Sample Subjects from Two Groups 4 SDs Apart

598 Views Asked by At

This question came up a little while ago but unfortunately was put on hold. However, I found it intriguing as I had never come across a question like this before. There are $2$ groups of $30$ people randomly selected from $2$ normally distributed populations with identical standard deviations but their means for a particular characteristic are $4$ standard deviations apart. What is the probability of an overlap of the $2$ persons with the highest and lowest statistic from the lowest and highest groups respectively?

My Attempt: The area of overlap for the two distributions is $.0456$. This is twice the area of a single tail either greater or lower than $2$ SDs from the mean. Additionally, the probability of getting at least one person from a group in the overlap zone is $^{30}C_1\cdot.9544^{29}\cdot .0456 + ^{30}C_2\cdot .9544^{28}\cdot .0456^2.............^{30}C_7\cdot .9544^{23}\cdot .0456^7$. The series reduces to levels of insignificance beyond the $7$th term.

The probability has to consider the different number combinations of people in the overlap zone as well as the probability of overlap in the zone. Limiting this consideration to $1$ to $7$ people from each group, these combinations are $1,1; 1,2; 2,1; 2,2; 3,1; 1,3; 3,2; 2,3; 3,3........7,7$. For the probability of overlap with $1,1$ in the overlap zone, we have equal probability of order, AB or BA where BA is an overlap so this probability is $\frac{1}{2}$. For $1,2$ we have equal outcomes of A,B,B; B,A,B and B,B,A so the probability of overlap is $\frac{2}{3}$. Generally, this part of the probability calculation is $$\frac{(n_1+n_2)!-1}{n_1!\cdot n_2!}$$ So, the probability of overlap in the overlap zone containing $4$ from group A and $5$ from group B would be:

$$\frac{(4+5)!-1}{4!\cdot 5!}= \frac{125}{126}$$

Putting this all together I get: $$P(A,B \ \text{overlap} \ge 1) = \frac{1}{2}(30\cdot .9544^{29}\cdot .0456)^2 + 2\cdot \frac{2}{3}(30\cdot .9544^{29}\cdot .0456)(^{30}C_2\cdot .9544^{28}\cdot .0456^2) + \frac{5}{6}(^{30}C_2\cdot .9544^{28}\cdot .0456^2)^2 + 2\cdot \frac{3}{4}(30\cdot .9544^{29}\cdot .0456)(^{30}C_3\cdot .9544^{27}\cdot .0456^3) + 2\cdot \frac{9}{10}(^{30}C_2\cdot .9544^{28}\cdot .0456^2)(^{30}C_3\cdot .9544^{27}\cdot .0456^3) + \frac{19}{20}(^{30}C_3\cdot .9544^{27}\cdot .0456^3)^2 + ............ \frac{3431}{3432}(^{30}C_7\cdot .9544^{23}\cdot .0456^7)^2 = .4044$$

Does anyone want to comment on the method or correctness or have a simpler method for doing this calculation?

I did look at a $(1 - p)$ type solution of not being in the overlap zone or only A's or only B's in the zone plus not overlapping when both A's and B's were in the zone but this turned out to be just as long as the method I used.

enter image description here

I think I can see the flaw in my reasoning. An overlap of members between group A and B isn't limited to the region of overlap of distribution curves. That is, an A member can be outside the overlap zone but still be further to the right than a B member in the overlap zone.

2

There are 2 best solutions below

5
On BEST ANSWER

I tried a Monte Carlo simulation, with the result that the probability of an overlap is about $0.523$.

First, here is my version of the problem statement. We have $X_1, X_2, X_3, \dots , X_{30}$ and $Y_1, Y_2, X_3, \dots , Y_{30}$, where each $X_i$ is drawn independently from a Normal distribution with mean $-2$ and standard deviation $1$, and each $Y_i$ is drawn independently from a Normal distribution with mean $2$ and standard deviation $1$. The difference of the means is therefore $4$ standard deviations. We would like to know the probability that the maximum of $X_1, X_2, X_3, \dots , X_{30}$ is greater than the minumum of $Y_1, Y_2, X_3, \dots , Y_{30}$.

One way to estimate the desired probability is to simulate many trials, using a pseudo-random number generator to generate the Normal variables. When I ran $10^6$ trials, the result was that the max $X$ was greater than the min $Y$ in $523,460$ cases, so the estimated probability is about $0.523$. A $95\%$ confidence interval for the probability is $0.5225$ to $0.5244$.

I used R for the simulation. The purpose of the set.seed statement at the start is to make the results reproducible, so anyone running the same code should get exactly the same results.

> set.seed(1234)
> ntrials <- 1e6
> nsucc <- 0
> for (t in 1:ntrials) {
+   xmax <- max(rnorm(30, -2, 1))
+   ymin <- min(rnorm(30, 2, 1))
+   if (xmax > ymin)
+     nsucc <- nsucc + 1
+ }
> nsucc 
[1] 523460
> prop.test(nsucc, ntrials)

        1-sample proportions test with continuity correction

data:  nsucc out of ntrials, null probability 0.5
X-squared = 2201.4, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5224805 0.5244393
sample estimates:
      p 
0.52346 
2
On

In the general case, let us assume without loss of generality that for some $\mu > 0$, $$X_i \sim \operatorname{Normal}(-\mu, 1), \quad Y_i \sim \operatorname{Normal}(\mu,1),$$ and $X_i$ and $Y_i$ for each $i \in \{1, 2, \ldots, n\}$ are IID. We want to calculate $$\Pr[X_{(n)} > Y_{(1)}],$$ where $X_{(n)} = \max_i X_i$ is the maximum order statistic of the sample from $X$, and $Y_{(1)} = \min_i Y_i$ is the minimum order statistic of the sample from $Y$. It is not difficult to see that $$F_{X_{(n)}}(x) = \Pr[\max_i X_i \le x] \overset{\text{ind}}{=} \prod_{i=1}^n \Pr[X_i \le x] \overset{\text{id}}{=} F_X(x)^n,$$ thus $$f_{X_{(n)}}(x) = n f_X(x) F_X(x)^{n-1},$$ where $f_X$ and $F_X$ are the density and cumulative distribution functions of $X$, respectively. Similarly, $$F_{Y_{(1)}}(y) = 1 - \Pr[\min_i Y_i > y] \overset{\text{ind}}{=} 1 - \prod_{i=1}^n \Pr[Y_i > y] \overset{\text{id}}{=} 1 - (1 - F_Y(y))^n,$$ and $$f_{Y_{(1)}}(y) = nf_Y(y)(1 - F_Y(y))^{n-1}.$$ Then $$\Pr[X_{(n)} > Y_{(1)}] = \int_{y=-\infty}^\infty \Pr[X_{(n)} > y]f_{Y_{(1)}}(y) \, dy = n \int_{y=-\infty}^\infty (1 - F_X(y)^n)f_Y(y) (1 - F_Y(y))^{n-1} \, dy.$$ Unfortunately this integral does not have a closed form for general $n$ and $\mu$, but it can be numerically integrated for the specific choice $n = 30$, $\mu = 2$, yielding $$0.52346740582758868816989507158239333395206638679228\ldots.$$

A plot of the probability of "overlap" for $n = 1, 2, \ldots, 100$ is provided below.

enter image description here

Note that the shape of the curve would be different for various choices of $\mu$; e.g., as $\mu$ increases, the curve will flatten out. A plot of the probability as a function of both $n$ and $\mu \in [0,5]$ looks like this:

enter image description here