There is an urn with $B$ balls of $C$ distinct colours ($B >> C$). Both $B$ and $C$ are finite but both are unknown.
I draw sample $N$ balls without replacement from the urn and that's my only source of stats about the general population (number of distinct colours in the sample, balls per colour, etc).
Let's say I draw another sample of size $K$ from the urn, again without replacement.
What is the expected number of new colours found (i.e. number of distinct colours in sample $K$ that were not found in sample $N$)?
I tried to solve the problem using Good-Turing frequency estimation, but i got stuck.
Note: As I'm using the small frequencies here only, I omitted the smoothing part of the estimation.
Here's my attempt (using the symbols from Wiki's link):
At first draw (of total K), let's call it $X_1$:
- Probability of drawing a new colour: $$p_{0\_1} = \frac{N_1}{N}$$
- Probability of drawing a colour seen once so far: $$p_{1\_1} = \frac{2N_2}N$$
- Expected value of new colours drawn: $$E(X_1) = 1 * p_{0\_1} + 0 * (other\_frequencies\_drawn)$$
At second draw, $X_2$:
Probability of drawing a new colour: $$p_{0\_2} = p_{0\_1}\frac{N_1+1}{N+1} + p_{1\_1}\frac{N_1-1}{N+1} + (1 - p_{0\_1} - p_{1\_1})\frac{N_1}{N+1}$$ Explaining the above equation, it has 3 parts since it seems only 3 events concern us:
- A new colour was found in previous draw (both nominator and denominator increase)
- No new colour was found in previous draw, but we drew a colour that had frequency=1, so it's now in frequency=2 group (nominator decreased, denominator increases)
- No new colour was found in previous draw and the colour drawn is from frequency>1 (nominator doesn't change, denominator increases)
Expected value of drawing new colour: $$E(X_2) = 1 * p_{0\_2} + 0 * (other\_frequencies\_drawn)$$
Now, the next step will get big (9 parts) and I'm not sure if I should have found the pattern by now, but I can't really see it. Assuming the general approach is correct, I believe what I should do now is find a product pattern like $$E(X_n) = \prod_{i=0}^n something$$ for expected number of colours in every draw and then sum them up like $$Total = \sum_{i=0}^K E(i)$$.
Is the approach reasonable? Is there any easy way to find the product pattern without expanding it further?
I found an answer to my own question, the exact thing I was looking for is the Good-Toulmin estimator.
Wiki: https://en.wikipedia.org/wiki/Unseen_species_problem
Paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.825.3806&rep=rep1&type=pdf