Number of unique items in a big set from small sample

47 Views Asked by At

We have a box with $m=1\,000\,000$ cards. Each card contains one word. The words are repeated so there is a relatively small number of $n$ unique words. $n$ is unknown.

If we get a sample of $k=5000$ cards, we find that there is $42$ unique words in our sample.

With this information, we know $P(n\geq42) = 1$.

How can we know $P(n\geq43)$, $P(n\geq44)$..., and so on?

Is this problem common, and does it have a "common name"?

PS: we have the information on the frequency of each of our $42$ words for the $5000$ card sample, they can be used for the solution if it is relevant. Lets call this frequencies $f_1, f_2, \dots,f_{42}$.