I am working on a data science project which is slowly turning into a mathematical problem : To keep things short, I have a population of size $A$. This population contains $C$ classes and we know the distribution of those classes inside the population (i.e. for each class we know the probability to get one member of this class, if we sample one element from $A$ : $p_1, ... p_C$).
Now, at the start of an algorithm I made, I sample $n$ elements from $A$ and I have observed that the outcome of the algorithm depends greatly on this initial sampling : If I have at least one element of each class in my initial sampling, my algorithm will do great. If one or more classes are not represented in my initial sampling, my algorithm will perform poorly.
Given that I am free to choose whichever $n$ I see fit, I would like to know which one to choose to be "sure" to choose at least one element from each class given $A$, $C$, and $p_1, ... p_C$.
Because I will have to work on several dataset, I would like to know if there is a way to find $n$, given any population size $A$, any number of classes $C$ and any distribution $p_1, ... p_C$. Is there a way to do so ?
It looks like some kind of reverse birthday problem. I have started looking at ways to solve this problem using combinatorics but I don't see any easy way to find a general solution.
Thanks for your help !
Suppose the total population is $N$, partitioned into $C$ classes with sizes $N_i$ (so $N = N_1 + \cdots + N_C$). We choose $n$ points at random. We want the probability that we have at least one from each class.
The total number of samples of size $n$ is $\binom{N}{n}$.
We can choose one from each class $N_1 N_2 \cdots N_C$ ways, and then choose the remaining points from the remaining population in $\binom{N-C}{n-C}$ ways. So the probability that we have at least one from each class is $$ \frac{N_1 N_2 \cdots N_C \binom{N-C}{n-C}}{\binom{N}{n}} . $$
Note: Fixed in response to OP's correction.