Expected number of tosses for a number to repeat $N$ times given an $n$-sided die.

366 Views Asked by At

I am currently reading a "pop-science" book on statistical fallacies. On page 36 the authors discuss how events can cluster around certain locations by chance. The authors exemplify this by a $6*6$ checkerboard and two dice.

enter image description here

Without further explanation they calculate that a square with $4$ events will happen roughly every second simulation ($\frac{1}{0.54} = 1.85 \approx 2$), where one simulation consists of $36$ rolls of two dice (as far as I understand).

Now I understood that, according to this answer, the expected number of rolls for a number to repeat, given a single fair die, is:

$$\operatorname{E}[r] = \sum_{r=2}^{n+1} r \frac{n!(r-1)}{(n-r+1)!n^r}$$

So given that rolling two dice or one 36-sided die give equal results, I can calculate the number of rolls to see a square with two events. But how do I generalize this to $N$ events? Which I suspect is one way to figure out how many rolls it takes to see a square with $4$ events, right?

Also, according to a footnote, the authors appear to be using the Poisson distribution to calculate this.

I know that the Binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product $np$ remains fixed.

I see how one could e.g. calculate the probability that $4$ events occur in a specific square using both the Binomial and Poisson distribution for a simulation of 36 throws of a 36-sided die:

$${36 \choose 4} \left( \frac{1}{36}\right)^4 \left( \frac{35}{36}\right)^{32} \approx 0.014$$

$$\frac{1^4e^{-1}}{4!} \approx 0.015$$

But I am not sure how the authors used the Poisson distribution to calculate the expected number of simulations of 36 throws necessary to see an unspecified square with 4 events.

Sorry for the long question, I'd appreciate if someone could at least point out some book or other resource where I can learn all this from. Thanks a lot!

2

There are 2 best solutions below

10
On BEST ANSWER

I think the cited book may be trying to state something much simpler than the question and answers (so far) seem to suppose.

One trial of the experiment consists of $36$ instances of selecting at random one of $36$ boxes. The expected number of boxes to be selected exactly $k$ times per trial is then($^*$) $36\ p_k$, where

$$p_k = {36 \choose k} \left( \frac{1}{36}\right)^k \left( \frac{35}{36}\right)^{36-k}\ \ [k\in \{0,1,...,36\}]$$

with corresponding Poisson approximations.

For ease of expression, let's refer to a box selected exactly $k$ times (in one trial) as a "$k$-hit" box.

Now, the above rate of $36\ p_k$ $k$-hit boxes per trial is the same as $\frac{1}{36\ p_k}$ trials per $k$-hit box.

E.g., a rate of $36\ p_4 \approx 36\cdot 0.015 \approx 0.54$ $4$-hit boxes per trial is the same as approximately $\frac{1}{0.54}\approx 1.85\approx 2$ trials per $4$-hit box, just as the cited book states.

NB: The expected number of trials per $k$-hit box is $E(\frac{1}{N_k})$, where $N_k$ is the number of boxes that are hit exactly $k$ times in one trial. For this, $\frac{1}{E(N_k)}$ might not be a very good approximation; however, the latter appears to be what the book uses, perhaps because it can be more easily calculated (without need of simulation). Note that, although similar, this is not the same as the expected number of trials needed to obtain at least one $k$-hit box (which could also be readily computed from the binomial probabilities $p_k$).


($^*$) This can be proved easily by the same method as used here. Note that $X$ (= the maximum number of hits among the boxes, in one trial) is not the quantity of interest; rather, the focus is on $N_k$ (= the number of boxes that are hit exactly $k$ times in one trial). That is, the computation is not about $P(X=k)$, but rather $E(N_k)$. This expectation can be written $$E(N_k) = E(I^{(k)}_1 + I^{(k)}_2 + ... + I^{(k)}_{36}) = 36\ E(I^{(k)}_1) = 36\ p_k$$ where $I^{(k)}_j =1\text{ IF box }j\text{ is hit exactly }k\text{ times ELSE }0$.

NB: The indicator variables $I^{(k)}_j$ are not independent, but the result holds as stated because of the linearity of the expectation operator.

7
On

Comment:

Your analysis for a particular cell seems correct. However, you are looking for $P(X = 4)$, where $X$ is the maximum number of hits in any of 36 cells when the two-dice experiment is repeated 36 times. In the spirit of the problem, it may make more sense to look for $P(X \ge 4).$

If it helps toward an analytic solution, I conclude from a simulation in R that the author's claim is wrong. The table at the end of the following simulation gives the approximate distribution of $X$ correct to about 2 places--enough to see that $P(X = 4)$ is nearer to 0.39 than to 0.54. The author also claims that $P(X = 5) = 0.11$, while my simulation has $P(X = 5) \approx 0.09.$

However, the author is correct in his general message that such 'clusters' are more likely than one might guess from intuition. And one gets four or more hits about half the time.

 m=10^5; x = numeric(m)
 for(i in 1:m){
   x[i]= max(rle( sort(sample(1:36,36,repl=T)))$length)  }
 mean(x==4);  mean(x >= 4)
 ## 0.39411
 ## 0.49928
 round(table(x)/m, 3)
 ## b
 ##     2     3     4     5     6     7     8     9 
 ## 0.018 0.482 0.394 0.090 0.014 0.002 0.000 0.000 

Note on the simulation: Cells are numbered 1 through 36. The R function rle is for 'run length encoding'. Finding lengths of runs in $sorted$ data is an easy (if perhaps not optimally efficient) way to find numbers of repeated hits.