$N$ red/blue balls. Draw $M$ and $K$ are red. What is prob $X$ or more red of $N$?

55 Views Asked by At

Suppose I have a bag of $N$ balls and I know each ball is either red or blue.

I draw a random (indepedent) subset (without replacement) of $M$ balls, and I find that $K$ are red and $(M-K)$ are blue.

What is the probability $P$ that there were $X$ or more red balls in the original bag?

Clearly for $X \le K$, the probability is $1$.

And for $X > N - K$ the probability is 0.

How do I calculate $P$ when $X$ is within those bounds?

What about when $N$ is much larger than $M$? Is there an approximation?

1

There are 1 best solutions below

0
On

You need to have a prior distribution for the number of red balls in the bag at the start, $R$, in order to answer this question. It could be that the number of red balls is drawn from a uniform distribution: $$\mathbb{P}(R = r) = \frac{1}{N+1} \textrm{ for } r = 0, \dots, N,$$ or each ball could be chosen as red with some probability $p$: $$\mathbb{P}(R = r) = \binom{N}{r} p^r (1 - p)^{N-r} \textrm{ for } r = 0, \dots, N,$$ just to give two examples of prior distributions.

One you have some prior distribution then you can use Bayes' Theorem to answer the question. Let $R^\prime$ be the number of observed red balls. Then: $$\mathbb{P}(R = r \mid R^\prime = K) = \frac{\mathbb{P}(R^\prime = K \mid R = r)\mathbb{P}(R = r)}{\mathbb{P}(R^\prime = K)}.$$

The probability of drawing $K$ red balls given that there are $r$ red balls to start is given by the hypergeometric distribution: $$\mathbb{P}(R^\prime = K \mid R = r) = \frac{\binom{r}{K} \binom{N - r}{M - K}}{\binom{N}{M}} \textrm{ for } K = 0, \dots, r.$$

The denominator $\mathbb{P}(R^\prime = K)$ is calculated as follows: $$\mathbb{P}(R^\prime = K) = \sum_{s = K}^N \mathbb{P}(R^\prime = K \mid R = s)\mathbb{P}(R = s).$$ The lower bound for the sum is $s = K$ rather than $s = 0$ because $\mathbb{P}(R^\prime = K \mid R = s) = 0$ whenever $K > s$.

So to calculate $P = \sum_{r = X}^{N} \mathbb{P}(R = r \mid R^\prime = K)$, we need to calculate the following: $$P = \frac{\sum_{r = X}^N \mathbb{P}(R = r) \binom{r}{K} \binom{N - r}{M - K} / \binom{N}{M}}{\sum_{s = K}^N \mathbb{P}(R = s) \binom{s}{K} \binom{N - s}{M - K} / \binom{N}{M}}.$$

Checking the lower edge case, we can see that $P = 1$ if $X = K$, which is what we should expect.

As an example, suppose $N = 100$, $M = 10$, $K = 3$, and the uniform distribution is used for the prior, then the values of $P$ as a function of $X$ look like: P as a function of X

If we use the same $N, M, K$ but use the second prior distribution with $p = 2/3$ then we get the following values of $P$ as a function of $X$: P as a function of X 2

As $M$ gets larger then the prior distribution will become less important for estimating $P$. I don't know of an approximation to the formula for $P(R = r \mid R^\prime = K)$.