Probability of drawing at least J cards out of I cards of interest from a deck of N cards within D draws with replacement

283 Views Asked by At

Given a regular deck of 52 playing cards, if you draw from the deck 10 times with replacement, what is the probability that you have drawn at least 3 different aces (any of the 4 aces) at least once each? And how does this generalize to N playing cards, D draws, I many cards that we are interested in, and J many of the cards we are interested in that we must draw at least once.

I came across a simpler version of this problem while working on a hobby project trying to figure out some statistics related to a Pokémon-themed Discord bot gacha game, but I was intrigued with this problem after I found a few unanswered Math StackExchange posts about this variant. (1, 2) Unfortunately those posts didn't get any attention so I figured I would make a post with the additional information that I found.

Formalization

I was able to formalize the problem as $f(n_1, n_2, k_1, k_2, p_1)$, where:

  • $n_1$ is the number of cards drawn
  • $n_2$ is the number of cards that we are interested in
  • $k_1$ is the number of times at least that we want to draw a card we are interested in
  • $k_2$ is the number of cards at least that we are interested in that we want to draw at least $k_1$ times
  • $p_1$ is the probability of drawing a specific card

My attempts so far

Binomial distribution of binomial distributions

I had some initial attempts to solve it by modeling a binomial distribution where the events were obtaining at least 1 of a card of interest in $n_1$ draws (that also being a binomial distribution). But I validated that against some simulation code that I wrote, and the formulation came up with different results, which I think were due to the events being not independent (drawing one particular card in a set of draws effects the probability of drawing at least one of another particular card on the same set of draws).

I think that approach should give an upper bound on the answer for the problem. It models drawing at least $k_1$ copies of a card we are interested in by using a binomial distribution. It then (incorrectly) assumes that we can model that for each card we are interested in drawing as a binomial distribution (not true, since those probabilities are not independent).

But I think the lack of independence is always destructive in this case (drawing a particular card prevents you from have drawing any other card in that same draw action), so the actual probability should be upper bounded by this formula. But I'm not 100% sure on that.

$$ p_2 = \sum_{k=k_1}^{n_1} \binom{n_1}{k} * p_1^{k} * (1 - p_1)^{n_1 - k} $$ $$ \sum_{k=k_2}^{n_2} \binom{n_2}{k} * p_2^{k} * (1 - p_2)^{n_2 - k} $$

Multinomial distribution

After looking around on the internet for similar problems or approaches that might work I learned about multinomial distributions. I think you can try modelling the card drawing with replacement as a multinomial distribution and then sum the results of the probability distribution function with the different input sets that make up all the possible winning sets of draws, where there is probably a trick to group those sets into categories of equal probability.

This is my current best guess as to a framework for a solution that might work, but I haven't worked out the details on it and tested it against the simulation yet.

Simpler variant

Before I came across this variant of the problem, in my hobby project I came across the simpler case where $k_2 = n_2$ and $k_1 = 1$, the "collect all of the Pokémon at least once" problem. Unfortunately I haven't solved this question yet either.

Starting with solving this simpler variant might be useful in finding a solution for the more general case. But if we do find a solution for the general case, it should work for this specific case as well.

Simulation code

I posted the simulation code I wrote on a GitHub gist at the link below. You can run it by providing $n_1$, $n_2$, $k_1$, $k_2$, and the number of total cards in the deck. You can also optionally provide an RNG seed and a number of iterations (number of draws of $n_1$ cards to simulate).

$ python sim.py simulate 10 4 1 3 52 --seed 42
Namespace(command='simulate', iterations=100000, k_1=1, k_2=3, n_1=10, n_2=4, seed=42, total=52)
100%|████████████████████████████████| 100000/100000 [00:01<00:00, 95673.52it/s]
estimated_probability = 0.01499

The script also includes a mode to run the "binomial distribution of binomial distributions" approach that I think upper bounds the result.

$ python mathoverflow_answer.py calculate 10 4 1 3 52 
p_1 = 0.019230769230769232
p_exactly = 0.16147234355824405
p_2 = 0.17649104785295508
result = 0.019079344697167273

Internet research

While working on this problem I did some searching online to try to find similar problems or potential approaches for solutions. Here are some links I found the seemed potentially relevant.

  • Math StackExchange posts with the same question that went unanswered. (1, 2)
  • Resources related to cumulative distribution functions of multinomial distributions.
    • A page from the book "Chemical Process Dynamics and Controls", which includes a section 2.2 that gives a cumulative distribution function for multinomial distributions. (3)
    • A 1981 paper on a representation for a multinomial cumulative distribution function, which I don't really understand yet. (4)

Conclusion

Above are my thoughts on and current attempts at solving this problem. My current best lead is to try something with multinomial distributions, but I haven't been able to solidify and test that approach yet.

Does anyone have thoughts on how to go about solving this problem? Does this approach seem reasonable? Are there better approaches that yield a solution to this problem?

2

There are 2 best solutions below

0
On BEST ANSWER

This can be expressed as a recursive formula.

First, let's look only at $k_1=1$. If $n_1>0$ and $n_2 \geq k_2>0$, then

$$ f(n_1,n_2,1,k_2,p) = n_2 p \cdot f(n_1-1, n_2-1, 1, k_2-1,p) + (1-n_2 p) f(n_1-1, n_2, 1, k_2, p) $$

This is because if the first draw is a wanted card ($n_2 p$), then the conditional probability is the same as for an instance with one less draw ($n_1-1$), one less card type still "interesting" ($n_2-1$), and one less success needed to reach the target ($k_2-1$). Or if the first draw is not a wanted card ($1-n_2 p$), then the conditional probability is the same as for an instance with one less draw ($n_1-1$) but everything else staying the same.

We have initial conditions,

$$ \begin{align*} f(n_1, n_2, 1, 0, p) &= 1 \\ f(0, n_2, 1, k_2, p) &= 0 & \mathrm{if}~k_2>0 \end{align*}$$

since wanting to collect no more cards ($k_2=0$) means the player has already succeeded, and otherwise having no draws left ($n_1=0$) means they've failed.

From the recursion relation, we can see that each set of $f$ values with $n_2-k_2$ constant is an independent problem. We can also consider $p$ constant in the recursion problem. So let's call $m = n_2-k_2$, the maximum number of interesting card types we'd be okay with missing out on. Then there are just two variables to recurse on to get $f(n_1, n_2, 1, n_2-m, p)$.

A computer program can evaluate this kind of recursion fairly efficiently and with whatever accuracy is needed, including exact rational ratios. For $f(10, 4, 1, 3, 1/52)$, this would start like

$$ \begin{align*} f(1, 2, 1, 1, 1/52) &= \frac{2}{52} \cdot 1 \\ f(2, 2, 1, 1, 1/52) &= \frac{2}{52} \cdot 1 + \frac{50}{52} \cdot \frac{2}{52} = \frac{204}{52^2} \\ f(2, 3, 1, 2, 1/52) &= \frac{3}{52} \cdot \frac{2}{52} = \frac{6}{52^2} \\ f(3, 2, 1, 1, 1/52) &= \frac{2}{52} \cdot 1 + \frac{50}{52} \cdot \frac{204}{52^2} \\ f(3, 3, 1, 2, 1/52) &= \frac{3}{52} \cdot \frac{204}{52^2} + \frac{49}{52} \cdot \frac{6}{52^2} \\ f(3, 4, 1, 3, 1/52) &= \frac{4}{52} \cdot \frac{6}{52^2} \end{align*} $$

For $k_1 > 1$, things get trickier. A recursion will need more than the existing $f$ in five variables, since it can matter whether the player is looking for two more identical cards, or one more of each of two card types. Instead, we could introduce a sequence of counter variables $w_1, \ldots w_{n_2}$, representing for each interesting card type the number of cards the player still needs to reach a "success" for that card type: $g(n_1, k_2, p, w_1, \ldots, w_{n_2})$ is the probability that a player will get at least $k_2$ successes within the next $n_1$ draws, where $p$ is the probability of drawing any particular card and $w_i$ more draws of card $i$ counts as a success.

This relates to the original $f$ by taking all $w_i = k_1$. That is,

$$ f(n_1, n_2, k_1, k_2, p) = g(n_1, k_2, p, \underbrace{k_1, k_1, \ldots, k_1}_{n_2 \mathrm{~times}}) $$

Note that the value of $g$ does not change when the values of $w_i$ are permuted. This should drastically cut back on the number of calculations needed when applying the recursion, whether by hand or by computer. It would probably be helpful to keep the $w_i$ in non-descending or non-ascending order.

For initial conditions, again $k_2=0$ is a success and $k_2>0, n_1=0$ is a failure. The recursion is

$$ g(n_1, k_2, p, w_1, \ldots, w_{n_2}) = p \sum_{w_i = 1} g(n_1-1, k_2-1, p, w_1, \ldots, w_{i-1}, 0, w_{i+1}, \ldots, w_{n_2}) \\ {}+ p \sum_{w_i > 1} g(n_1-1, k_2, p, w_1, \ldots, w_{i-1}, w_i - 1, w_{i+1}, \ldots, w_{n_2}) \\ {}+ \left(1 - p \sum_{i=1}^{n_2} w_i\right) g(n_1-1, k_2, p, w_1, \ldots, w_{n_2}) $$

1
On

First, let me say that I think simulation is the best way to do this problem. The reason for this will be apparent as I describe how to do it analytically.

First, let's count the number of ways that we can draw three different Aces. We have $4$ ways to choose the Aces, then $10\cdot9\cdot8$ ways to distribute them among the $10$ draws. The other $7$ cards can be any card at all, so we have $52^7$ possibilities for them. If this were the end of it, we would have a probability of $$\frac{4\cdot10\cdot9\cdot8}{52^3}\approx0.0204824761,$$ but we've double-counted cases where we drew more than $3$ Aces.

All the cases where we drew four Aces, at least $3$ of which are distinct have been counted more that once. We might have drawn each Ace once, or we might have drawn three different Aces, and drawn one of them twice. In the first case, we will have counted the event $4$ times, and in the second case, we will have counted each event twice.

Let's do these. For the cases of all four Aces, we have $$\frac{10\cdot9\cdot8\cdot7}{52^4}\approx0.0006893141$$ for the second case, we have $4$ way to choose the $3$ Aces, then $3$ ways to choose the duplicated Ace, then $\frac{10\cdot9\cdot8\cdot7}2$ ways to distribute the Aces, so a probability $$\frac{10\cdot9\cdot8\cdot7\cdot6}{52^4}\approx0.00413588$$ We need to subtract $3$ times the first of these, and $1$ times the second, giving approximately $0.014278649$.

Now, we have to consider the cases where we drew five Aces. If we do this, we will find that we subtracted some events too many times, and we will have to add these back in. Of course, we have to continue this until we deal with the cases where we drew $10$ Aces. It gets quite tedious, not only because we have to treat all the ways we could have draw $n$ Aces, at least $3$ of which are distinct, but for each of those cases, we have to count how many times it was added or subtracted at the previous stage.

This is the principle of inclusion and exclusion. The various steps do give bounds on the probability though. These are the Bonferroni inequalities. If $p$ is the probability we seek, we can say $$ 0.014278649 < p < 0.0204824761$$ which agrees nicely with your simulation.

Of course, we could go through only some of the stages, and prove precise upper and lower bounds on the probabilities, if it were considered worthwhile. If I had to do that for some reason, I would check my answer by simulation.

It would be an interesting exercise to write a computer program to compute the answer precisely. If you want to try it, I suggest doing it first for the $3$ Aces problem, before trying to generalize it.

EDIT

I thought of what I think will be a better way of computing precise bounds. Instead of alternately overstating and understating the probability, compute successively better lower bounds, until the possible understatement is neglibilible.

The probability that of drawing exactly $3$ distinct Aces, and no other Ace is $$\frac{4\cdot10\cdot9\cdot8\cdot48^7}{52^{10}}\approx0.01169627$$ The probability of drawing exactly $4$ Aces, all of which are distinct is $$\frac{10\cdot9\cdot8\cdot7\cdot48^6}{52^{10}}\approx0.0004264268$$ The probability of drawing exactly $4$ Aces, $3$ of which are distinct is $$\frac{10\cdot9\cdot8\cdot7\cdot6\cdot48^6}{52^{10}}\approx0.00255856$$ Adding gives $0.01468126$ as a lower bound on the probability. How much are we missing we if we quit here? Certainly, less than the probability of drawing at least $5$ Aces. There are $\binom{10}5$ ways to choose which draws will be Aces, and $4$ choices of which Ace will be drawn, then $52$ choices for each of the other spots. This gives $$\frac{4^5\binom{10}5}{52^5}\approx0.0006787$$ so we can say $$0.01468126<p<0.01536$$

This approach would be easier to program than inclusion-exclusion, certainly.