Given a regular deck of 52 playing cards, if you draw from the deck 10 times with replacement, what is the probability that you have drawn at least 3 different aces (any of the 4 aces) at least once each? And how does this generalize to N playing cards, D draws, I many cards that we are interested in, and J many of the cards we are interested in that we must draw at least once.
I came across a simpler version of this problem while working on a hobby project trying to figure out some statistics related to a Pokémon-themed Discord bot gacha game, but I was intrigued with this problem after I found a few unanswered Math StackExchange posts about this variant. (1, 2) Unfortunately those posts didn't get any attention so I figured I would make a post with the additional information that I found.
Formalization
I was able to formalize the problem as $f(n_1, n_2, k_1, k_2, p_1)$, where:
- $n_1$ is the number of cards drawn
- $n_2$ is the number of cards that we are interested in
- $k_1$ is the number of times at least that we want to draw a card we are interested in
- $k_2$ is the number of cards at least that we are interested in that we want to draw at least $k_1$ times
- $p_1$ is the probability of drawing a specific card
My attempts so far
Binomial distribution of binomial distributions
I had some initial attempts to solve it by modeling a binomial distribution where the events were obtaining at least 1 of a card of interest in $n_1$ draws (that also being a binomial distribution). But I validated that against some simulation code that I wrote, and the formulation came up with different results, which I think were due to the events being not independent (drawing one particular card in a set of draws effects the probability of drawing at least one of another particular card on the same set of draws).
I think that approach should give an upper bound on the answer for the problem. It models drawing at least $k_1$ copies of a card we are interested in by using a binomial distribution. It then (incorrectly) assumes that we can model that for each card we are interested in drawing as a binomial distribution (not true, since those probabilities are not independent).
But I think the lack of independence is always destructive in this case (drawing a particular card prevents you from have drawing any other card in that same draw action), so the actual probability should be upper bounded by this formula. But I'm not 100% sure on that.
$$ p_2 = \sum_{k=k_1}^{n_1} \binom{n_1}{k} * p_1^{k} * (1 - p_1)^{n_1 - k} $$ $$ \sum_{k=k_2}^{n_2} \binom{n_2}{k} * p_2^{k} * (1 - p_2)^{n_2 - k} $$
Multinomial distribution
After looking around on the internet for similar problems or approaches that might work I learned about multinomial distributions. I think you can try modelling the card drawing with replacement as a multinomial distribution and then sum the results of the probability distribution function with the different input sets that make up all the possible winning sets of draws, where there is probably a trick to group those sets into categories of equal probability.
This is my current best guess as to a framework for a solution that might work, but I haven't worked out the details on it and tested it against the simulation yet.
Simpler variant
Before I came across this variant of the problem, in my hobby project I came across the simpler case where $k_2 = n_2$ and $k_1 = 1$, the "collect all of the Pokémon at least once" problem. Unfortunately I haven't solved this question yet either.
Starting with solving this simpler variant might be useful in finding a solution for the more general case. But if we do find a solution for the general case, it should work for this specific case as well.
Simulation code
I posted the simulation code I wrote on a GitHub gist at the link below. You can run it by providing $n_1$, $n_2$, $k_1$, $k_2$, and the number of total cards in the deck. You can also optionally provide an RNG seed and a number of iterations (number of draws of $n_1$ cards to simulate).
$ python sim.py simulate 10 4 1 3 52 --seed 42
Namespace(command='simulate', iterations=100000, k_1=1, k_2=3, n_1=10, n_2=4, seed=42, total=52)
100%|████████████████████████████████| 100000/100000 [00:01<00:00, 95673.52it/s]
estimated_probability = 0.01499
The script also includes a mode to run the "binomial distribution of binomial distributions" approach that I think upper bounds the result.
$ python mathoverflow_answer.py calculate 10 4 1 3 52
p_1 = 0.019230769230769232
p_exactly = 0.16147234355824405
p_2 = 0.17649104785295508
result = 0.019079344697167273
Internet research
While working on this problem I did some searching online to try to find similar problems or potential approaches for solutions. Here are some links I found the seemed potentially relevant.
- Math StackExchange posts with the same question that went unanswered. (1, 2)
- Resources related to cumulative distribution functions of multinomial distributions.
Conclusion
Above are my thoughts on and current attempts at solving this problem. My current best lead is to try something with multinomial distributions, but I haven't been able to solidify and test that approach yet.
Does anyone have thoughts on how to go about solving this problem? Does this approach seem reasonable? Are there better approaches that yield a solution to this problem?
This can be expressed as a recursive formula.
First, let's look only at $k_1=1$. If $n_1>0$ and $n_2 \geq k_2>0$, then
$$ f(n_1,n_2,1,k_2,p) = n_2 p \cdot f(n_1-1, n_2-1, 1, k_2-1,p) + (1-n_2 p) f(n_1-1, n_2, 1, k_2, p) $$
This is because if the first draw is a wanted card ($n_2 p$), then the conditional probability is the same as for an instance with one less draw ($n_1-1$), one less card type still "interesting" ($n_2-1$), and one less success needed to reach the target ($k_2-1$). Or if the first draw is not a wanted card ($1-n_2 p$), then the conditional probability is the same as for an instance with one less draw ($n_1-1$) but everything else staying the same.
We have initial conditions,
$$ \begin{align*} f(n_1, n_2, 1, 0, p) &= 1 \\ f(0, n_2, 1, k_2, p) &= 0 & \mathrm{if}~k_2>0 \end{align*}$$
since wanting to collect no more cards ($k_2=0$) means the player has already succeeded, and otherwise having no draws left ($n_1=0$) means they've failed.
From the recursion relation, we can see that each set of $f$ values with $n_2-k_2$ constant is an independent problem. We can also consider $p$ constant in the recursion problem. So let's call $m = n_2-k_2$, the maximum number of interesting card types we'd be okay with missing out on. Then there are just two variables to recurse on to get $f(n_1, n_2, 1, n_2-m, p)$.
A computer program can evaluate this kind of recursion fairly efficiently and with whatever accuracy is needed, including exact rational ratios. For $f(10, 4, 1, 3, 1/52)$, this would start like
$$ \begin{align*} f(1, 2, 1, 1, 1/52) &= \frac{2}{52} \cdot 1 \\ f(2, 2, 1, 1, 1/52) &= \frac{2}{52} \cdot 1 + \frac{50}{52} \cdot \frac{2}{52} = \frac{204}{52^2} \\ f(2, 3, 1, 2, 1/52) &= \frac{3}{52} \cdot \frac{2}{52} = \frac{6}{52^2} \\ f(3, 2, 1, 1, 1/52) &= \frac{2}{52} \cdot 1 + \frac{50}{52} \cdot \frac{204}{52^2} \\ f(3, 3, 1, 2, 1/52) &= \frac{3}{52} \cdot \frac{204}{52^2} + \frac{49}{52} \cdot \frac{6}{52^2} \\ f(3, 4, 1, 3, 1/52) &= \frac{4}{52} \cdot \frac{6}{52^2} \end{align*} $$
For $k_1 > 1$, things get trickier. A recursion will need more than the existing $f$ in five variables, since it can matter whether the player is looking for two more identical cards, or one more of each of two card types. Instead, we could introduce a sequence of counter variables $w_1, \ldots w_{n_2}$, representing for each interesting card type the number of cards the player still needs to reach a "success" for that card type: $g(n_1, k_2, p, w_1, \ldots, w_{n_2})$ is the probability that a player will get at least $k_2$ successes within the next $n_1$ draws, where $p$ is the probability of drawing any particular card and $w_i$ more draws of card $i$ counts as a success.
This relates to the original $f$ by taking all $w_i = k_1$. That is,
$$ f(n_1, n_2, k_1, k_2, p) = g(n_1, k_2, p, \underbrace{k_1, k_1, \ldots, k_1}_{n_2 \mathrm{~times}}) $$
Note that the value of $g$ does not change when the values of $w_i$ are permuted. This should drastically cut back on the number of calculations needed when applying the recursion, whether by hand or by computer. It would probably be helpful to keep the $w_i$ in non-descending or non-ascending order.
For initial conditions, again $k_2=0$ is a success and $k_2>0, n_1=0$ is a failure. The recursion is
$$ g(n_1, k_2, p, w_1, \ldots, w_{n_2}) = p \sum_{w_i = 1} g(n_1-1, k_2-1, p, w_1, \ldots, w_{i-1}, 0, w_{i+1}, \ldots, w_{n_2}) \\ {}+ p \sum_{w_i > 1} g(n_1-1, k_2, p, w_1, \ldots, w_{i-1}, w_i - 1, w_{i+1}, \ldots, w_{n_2}) \\ {}+ \left(1 - p \sum_{i=1}^{n_2} w_i\right) g(n_1-1, k_2, p, w_1, \ldots, w_{n_2}) $$