Given a sample of size N, find expected number of previously unseen colours after another K draws

142 Views Asked by At

There is an urn with $B$ balls of $C$ distinct colours ($B >> C$). Both $B$ and $C$ are finite but both are unknown.

I draw sample $N$ balls without replacement from the urn and that's my only source of stats about the general population (number of distinct colours in the sample, balls per colour, etc).

Let's say I draw another sample of size $K$ from the urn, again without replacement.

What is the expected number of new colours found (i.e. number of distinct colours in sample $K$ that were not found in sample $N$)?

I tried to solve the problem using Good-Turing frequency estimation, but i got stuck.

Note: As I'm using the small frequencies here only, I omitted the smoothing part of the estimation.

Here's my attempt (using the symbols from Wiki's link):

At first draw (of total K), let's call it $X_1$:

  • Probability of drawing a new colour: $$p_{0\_1} = \frac{N_1}{N}$$
  • Probability of drawing a colour seen once so far: $$p_{1\_1} = \frac{2N_2}N$$
  • Expected value of new colours drawn: $$E(X_1) = 1 * p_{0\_1} + 0 * (other\_frequencies\_drawn)$$

At second draw, $X_2$:

  • Probability of drawing a new colour: $$p_{0\_2} = p_{0\_1}\frac{N_1+1}{N+1} + p_{1\_1}\frac{N_1-1}{N+1} + (1 - p_{0\_1} - p_{1\_1})\frac{N_1}{N+1}$$ Explaining the above equation, it has 3 parts since it seems only 3 events concern us:

    • A new colour was found in previous draw (both nominator and denominator increase)
    • No new colour was found in previous draw, but we drew a colour that had frequency=1, so it's now in frequency=2 group (nominator decreased, denominator increases)
    • No new colour was found in previous draw and the colour drawn is from frequency>1 (nominator doesn't change, denominator increases)
  • Expected value of drawing new colour: $$E(X_2) = 1 * p_{0\_2} + 0 * (other\_frequencies\_drawn)$$

Now, the next step will get big (9 parts) and I'm not sure if I should have found the pattern by now, but I can't really see it. Assuming the general approach is correct, I believe what I should do now is find a product pattern like $$E(X_n) = \prod_{i=0}^n something$$ for expected number of colours in every draw and then sum them up like $$Total = \sum_{i=0}^K E(i)$$.

Is the approach reasonable? Is there any easy way to find the product pattern without expanding it further?

1

There are 1 best solutions below

0
On BEST ANSWER

I found an answer to my own question, the exact thing I was looking for is the Good-Toulmin estimator.

Wiki: https://en.wikipedia.org/wiki/Unseen_species_problem

Paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.825.3806&rep=rep1&type=pdf