X-sided dice probability

137 Views Asked by At

I have an $x$-sided dice and I roll it $y$ times.

What is the probability that I will see at least $z$ of the $x$ sides?

e.g. If I rolled a $6$-sided dice $8$ times, what is the probability I see at least $4$ of the sides?

But I want to do with much bigger numbers

Specifically, I want to know what is probability I see at least $z$ of the $160,000$ sides if I rolled a $160,000$-sided dice $25,000$ times.

Thanks in advance!

3

There are 3 best solutions below

3
On

The chance of a side missing out is $$\left({x-1\over x} \right)^y$$
The chance of two sides missing out is $$\left({x-2\over x}\right)^y$$ The usual approximation by an exponential causes a large error.
I will use the binomial distribution.
On average, you will see $$\mu\approx x(1-e^{-y/x})$$ different sides. The expected value of the square is $$x(1-\left({x-1\over x}\right)^y)+ \\ (x^2-x)(1-2\left({x-1\over x}\right)^y+\left({x-2\over x}\right)^y) $$ So the variance is $$\sigma^2= x(\left({x-1\over x}\right)^y-\left({x-2\over x}\right)^y \\ +x^2(\left({x-2\over x}\right)^y-\left({x-1\over x}\right)^{2y})$$ This is about 1500 for your numbers, compared with 19000 when I neglected the x^2 term. As I mention in comments, this is close to $$\sigma^2\approx xe^{-y/z}-(x+y)e^{-2y/x}$$ For large $x$ and $y$, it will be very close to a normal distribution with that mean and variance. Sixty-eight percent of the time $z$ will be between $\mu-\sigma$ and $\mu+\sigma$ (note $\sigma$ is the square-root of $\sigma^2$ defined above), and ninety-five percent of the time it will be between $\mu-2\sigma$ and $\mu+2\sigma$.

1
On

Possible practical approach (without necessarily needing to approximate):

Define $P(a,b)$ to be the probability of seeing at least $z$ distinct sides after $a$ additional rolls having seen already $b$ distinct sides. Then we have a simple recurrence $$P(a,b)=\frac{b}{x}P(a-1,b)+(1-\frac{b}{x})P(a-1,b+1)$$ Base cases: $P(a,b)=0$ if $a+b<z$ and $P(a,b)=1$ if $b \geq z$.

You want $P(y,0)$.

This is around three hundred million subproblems for $y=25000$.

For example, for $z = 23145$, we get about $0.504$ (using double-precision floating point with no concern for numerical issues, took maybe a minute to compute by computer).

($23145$ chosen as an "interesting" value of $z$ using the coupon collector problem to calculate how many distinct coupons to collect from $160000$ to give expected number of draws of about $25000$, ie $160000 (H_{160000}-H_{160000-23145}) \approx 25000$ where $H_n$ is the $n$th harmonic number; can also get $23145$ from Empy2's answer.)

0
On

There is an exact solution involving Stirling numbers of the second kind and if you wanted to do the calculations for throwing eight $6$-sided dice then the probabilities would be:

  • $1$ face: $\left\{{8\atop 1}\right\}\frac{6!}{6^8 (6-1)!} = \frac{6}{1679616}$
  • $2$ faces: $\left\{{8\atop 2}\right\}\frac{6!}{6^8 (6-2)!} = \frac{3810}{1679616}$
  • $3$ faces: $\left\{{8\atop 3}\right\}\frac{6!}{6^8 (6-3)!} = \frac{115920}{1679616}$
  • $4$ faces: $\left\{{8\atop 4}\right\}\frac{6!}{6^8 (6-4)!} = \frac{612360}{1679616}$
  • $5$ faces: $\left\{{8\atop 5}\right\}\frac{6!}{6^8 (6-5)!} = \frac{756000}{1679616}$
  • $6$ faces: $\left\{{8\atop 6}\right\}\frac{6!}{6^8 (6-6)!} = \frac{191520}{1679616}$

making the probability of $4$ or more faces $\frac{1559880}{1679616} \approx 0.9287$

This can become intractable for large $x$ and $y$. An alternative approach is a recurrence with $$p(x,y,z) = \frac{z}{x}p(x,y-1,z) + \frac{x-z+1}{x}p(x,y-1,z-1)$$ starting with $p(x,1,1)=1$ and $p(x,1,z)=0$ for $z \not=1$.

Suppose we start with your $x=160000$ and $y=25000$ and suppose we want to find for example $\mathbb P(Z \ge 23100)=\sum\limits_{z=23100}^{160000} p(x,y,z)$. Then we could use the following R code:

x <- 160000
maxy <- 25000
atleastz <- 23100
p <- c(1, rep(0, x-1))
for (y in 2:maxy){ p <- (1:x)/x * p + (x-(1:x)+1)/x * c(0,p[-x]) }
sum(p[atleastz:x])
# 0.8783097

Even that takes a little time with large $x$ and $y$. We could instead use a normal approximation (and a continuity correction) starting from the exact results $$\mu=E[Z] = x \left(1- \left(1-\frac1x\right)^y\right)$$ $$\sigma^2=\mathrm{Var}(Z)=x\left(1-\frac1x\right)^y + \left(x^2-x\right)\left(1-\frac2x\right)^y - x^2\left(1-\frac1x\right)^{2y}$$ which with your $x=160000$ and $y=25000$ are a mean of $23144.81$ and variance of $1506.32$ (i.e. a standard deviation of $38.81$).

We then want $\mathbb P(Z \ge z)\approx 1-\Phi\left(\frac{z-\frac12 - \mu}{\sigma}\right)$. With R this might suggest

x <- 160000
y <- 25000
atleastz <- 23100
m <- x*(1-(1-1/x)^y)
v <- x*(1-1/x)^y + (x^2-x)*(1-2/x)^y - x^2*(1-1/x)^(2*y)
1 - pnorm((atleastz - 1/2 - m)/sqrt(v))
# 0.8785072 

which in this example is very close to the exact result.