From a box of balls with 20 distinct colors, after drawing n balls with replacement, what is the probability of observing exactly x different colors?

84 Views Asked by At

So assuming there is equal probability of observing any distinct color in a draw, I drew out a pyramid where every draw creates two new edges from its nodes. This then becomes a recursive problem with $2^n$ complexity.

The $n$ values I am dealing with are in the range of a few thousand, so it's very computationally demanding, but not completely unfeasible.

However, I need to solve a more complicated version where each of the $20$ colors have their own probability, then it becomes a $20^n$ problem which I have no method for solving.

I am happy with an approximate answer if there is an algorithm that can give me one for a problem like this.

Thanks,

Michael

1

There are 1 best solutions below

2
On

Equal probabilities

If each of the 20 colors is equally likely to be drawn, then the probability of observing $x$ different colors is $$ {n \brace x} \frac{20!}{(20-x)!}\cdot 20^{-n}. $$ Here, ${n \brace x}$ is a Stirling number of the second kind, which can be computed as follows: $$ {n \brace x}= \frac1{x!}\sum_{j=0}^x(-1)^j\binom{x}j (x-j)^n $$

Unequal probabilities

Here is an approximation which might be helpful. However, I cannot prove this approximation works with any quality, so please use this with caution.

You have $20$ colors with unequal probabilities. Let us say that the probability of each color is specified by a list $p_1,\dots,p_{20}$, where $p_i\ge 0$ for each $i$ and $p_1+\dots+p_{20}=1$.

Let $C_1,\dots,C_{20}$ be the random variables representing the number of balls of each color which were drawn. The reason this problem is so difficult is that the variables $C_1,\dots,C_{20}$ are not independent. We can get a nice answer by approximating them by independent Poisson random variables. Specifically, for each $i\in \{1,\dots,20\}$, let $$ Z_i \sim \text{Poisson}(np_i),\qquad \text{so}\quad P(Z_i=k)=e^{-np_i}\frac{(np_i)^k}{k!}\quad (k\in \{0,1,2,\dots\}) $$ such that the list $Z_1,\dots,Z_{20}$ is independent. The two distributions are related to each other by the following:

Theorem: Let $(c_1,\dots,c_{20})$ be a list of nonnegative integers summing to $n$. The probability that $(C_1,\dots,C_{20})$ assumes the value $(c_1,\dots,c_{20})$ is equal to the conditional probability that $(Z_1,\dots,Z_{20})$ equals $(c_1,\dots,c_{20})$ given that $Z_1+\dots+Z_{20}=n$.

Since $\mathbb E[Z_1+\dots+Z_{20}]=n$, and since $Z_1+\dots+Z_{20}$ is concentrated somewhat tightly around its mean, this theorem suggests that the vector $(C_1,\dots,C_{20})$ is well-approximated by the vector $(Z_1,\dots,Z_{20})$.

You want to find the probability that the list $(C_1,\dots,C_{20})$ has exactly $x$ nonzero entries. I will approximate this by the probability that $(Z_1,\dots,Z_{20})$ has exactly $x$ nonzero entries. To this end, for each $i\in \{1,\dots,20\}$, let $$ a_i=P(Z_i>0)=1-e^{-np_i},\\ b_i=P(Z_i=0)=e^{-np_i},\,\,\,\,\;\;\;\; $$ Then $$ \begin{align} P(\vec Z\text{ has $x$ nonzero entries}) &= b_1\cdot b_2 \cdots b_{20}\sum_{1\le i_1< i_2\le \dots <i_{x}\le 20} \prod_{k=1}^x (a_{i_k}/b_{i_k}) \\\\ &= b_1\cdot b_2 \cdots b_{20}\cdot e_{x}\left(\frac{a_1}{b_1},\frac{a_2}{b_2},\dots,\frac{a_{20}}{b_{20}}\right)\tag1 \\ &= a_1\cdot a_2 \cdots a_{20}\cdot e_{20-x}\left(\frac{b_1}{a_1},\frac{b_2}{a_2},\dots,\frac{b_{20}}{a_{20}}\right)\tag2 \end{align} $$ Here, $e_x(y_1,\dots,y_{20})$ is the $x^\text{th}$ elementary symmetric polynomial in $20$ variables. I showed how to compute the values of these polynomials in this answer, with some more detail in this other answer.

Both equations $(1)$ and $(2)$ are equal, but one of them might be more numerically stable to compute, given the number of balls and the probabilities. Basically, you should choose whichever one involves the least division by small numbers, because that leads to numerical error.