Probability of sampling at least one member of each class

135 Views Asked by At

Suppose we have $N$ samples partitioned into $C$ classes with $N_1, ..., N_C$ the number of samples in each class. Randomly we pick up $n$ samples. What is the probability to have every class in this selection (at least one sample from each class)

For this problem, I came up with the solution presented here (https://math.stackexchange.com/users/72858/ethan-bolker)

The total number of samples of size $n$ picked up among $N$ is $\binom{N}{n}$ Then you have $N_1$ ways to choose one sample from the first class, $N_2$ from the second, ... $N_C$ from the last one. And finally $\binom{N-C}{n-c}$ ways to pick up indifferently the remaining samples.

$$\frac{N_1 N_2...N_C\binom{N-C}{n-c}}{\binom{N}{n}}$$

However with $N=10 000$, $N_1=N_2=...=N_C=1000$ and $n=100$, I get very high probabilities around $10^9$

So I thought I did a mistake and went with the complementary event "What is the probability of having at least one class that is NOT represented ?" Here we will take balanced classes $N_1=N_2=...=N_C$

There are $C$ ways to choose the class that is not represented then you have $\binom{N-N_1}{n}$ ways to pick up the $n$ samples among the $C-1$ classes that are represented (population of size $N-N_1$)

So the answer of the first problem would the be $$1- \frac{C\binom{N-N_1}{n}}{\binom{N}{n}}$$

However, once again with $N=10 000$, $N_1=N_2=...=N_C=1000$ and $n=10$ I will obtain negative or above 1 probabilities.

Could someone please tell me why these formulas are false

1

There are 1 best solutions below

2
On BEST ANSWER

Consider for example $6$ samples, $2$ of which are blue (B1 and B2), $2$ of which are green (G1 and G2), and $2$ red (R1 and R2). Then we want to pick at least one sample from each color with $n=4$

With your initial formula, the choice $\{\text{B1,B2,G1,R1}\}$ gets counted twice! Indeed, we could first pick $\{\text{B1,G1,R1}\}$, one from each color, and then pick $\{\text{B2}\}$ from the remaining samples, or we could first pick $\{\text{B2,G1,R1}\}$, and then pick $\{\text{B1}\}$ from the remaining samples.

The bigger $n$ gets, the more incorrect double counting you do, so that you incorrectly get probabilities that exceed $1$.


Your second approach is also wrong due to double counting: Let's add two samples of color yellow (Y1 and Y2) to the above example. One way to not pick at least one sample from each color is the choice $\{B1, B2, G1, G2\}$. But you count this twice! Indeed, on could first pick to ignore the red color and then choose $\{B1, B2, G1, G2\}$, or one could first pick to ignore the yellow color and then choose $\{B1, B2, G1, G2\}$. Because you subtract incorrectly doubly-counted cases, you can incorrectly end up with negative probabilities.