Asymptotics of the classical occupancy problem

511 Views Asked by At

Classical Occupancy Problem. There are $n$ distinct labeled balls in an urn. $k$ of them of uniformly selected with replacement. What is the probability that the sample contains at least one ball of each kind.

Let $(M_1,\ldots, M_n)$ be the vector of counts of each kind in the sample. The vector $(M_1,\ldots, M_n)$ follows multinomial distribution $\operatorname{Mult}\left(k, \{\frac{1}{n},\ldots, \frac{1}{n}\}\right)$, and the probability is $$ p_{k,n} = \Pr(M_1 > 0, \ldots, M_n >0) \tag{1} $$ It can be explicitly computed either using inclusion-exclusion principle, or by building a recurrence relation for $p_{k,n}$: $$ p_{k,n} = \sum_{m=0}^n (-1)^m \binom{n}{m} \left(1-\frac{m}{n}\right)^k = \frac{n!}{n^k} \mathcal{S}_2\left(k, n\right) \tag{2} $$ where $\mathcal{S}_2\left(k, n\right)$ denotes Stirling number of the second kind.

See Brian Gladman's notebook from mersseneforum.org or section 5.6 of Problems and snapshots from the world of probability for more details.

For $k \ggg n$, the probability $p_{k,n}$ is very close to 1 (see DMLF for asymptotic behavior of the Stirling number). I am interested in obtaining more precise asymptotic expansion, perhaps by using CLT to evaluate $(1)$.

References or explicit solutions are appreciated.

1

There are 1 best solutions below

2
On BEST ANSWER

First: the statement is more usually formulated as throwing $k$ balls into $n$ urns, and computing the probability that no urn is empty.

An asymptotic can be obtained by a Poissonization argument: for large $n,k$ the process is asympotically equivalent to throwing $n$ iid Poisson variables with $\lambda = k/n$ [*]. The probability that all are strictly positive is, in this approximation:

$$P(n,k) \approx (1- e^{-\lambda})^n =\left(1-\exp\left(-\frac{k}{n}\right)\right)^n$$

[*] More precisely: if $X=\{x_1, x_2 \cdots x_n\}$ is a symmetric multinomial with $\sum x_i=k$, and $Y=\{y_1, y_2 \cdots y_n\}$ is iid Poisson with $E(y_i)=\lambda = k/n$, then $P(Y | \sum y_i = k ) =P(X)$

Some values of $1-P$

  n    k     exact       approx        approx2
 10   30  0.370862811  0.399919706   0.391737310
 10   45  0.085332428  0.105697883   0.090006050
 10   67  0.008580587  0.012241161   0.008799062
 10  100  0.000265605  0.000453907   0.000249721
 10  150  0.000001369  0.000003059   0.000000918
 20   60  0.639394774  0.639903641   0.651552400
 20   90  0.183791028  0.200223724   0.188593703
 20  135  0.019540042  0.023158931   0.019936973
 20  202  0.000632592  0.000821271   0.000634614
 20  303  0.000003559  0.000005266   0.000003403
 30   90  0.793550984  0.783913271   0.800389319
 30  135  0.272036304  0.284758383   0.276500488
 30  202  0.031460596  0.035106950   0.031928138
 30  303  0.001037105  0.001231653   0.001045158
 30  454  0.000006205  0.000008031   0.000006140
 40  120  0.881840513  0.870330612   0.885651604
 40  180  0.350826354  0.360357908   0.354883558
 40  270  0.042236862  0.045781526   0.042723586
 40  405  0.001408318  0.001601360   0.001419115
 40  607  0.000008470  0.000010272   0.000008452
 50  150  0.932380090  0.922187956   0.934494710
 50  225  0.421119127  0.427966723   0.424774699
 50  337  0.053946062  0.057450981   0.054452003
 50  505  0.001852757  0.002051912   0.001865740
 50  757  0.000011405  0.000013297   0.000011417
 60  180  0.961304919  0.953306526   0.962474829
 60  270  0.483814133  0.488429430   0.487093918
 60  405  0.064472541  0.067880206   0.064980408
 60  607  0.002223894  0.002421147   0.002237862
 60  910  0.000013672  0.000015536   0.000013702
 70  210  0.977857699  0.971980166   0.978503440
 70  315  0.539725936  0.542501356   0.542661548
 70  472  0.075928083  0.079278633   0.076442829
 70  708  0.002632424  0.002830824   0.002647309
 70 1062  0.000016170  0.000018040   0.000016214
 80  240  0.987329875  0.983185850   0.987685543
 80  360  0.589585377  0.590857994   0.592209048
 80  540  0.086207817  0.089467104   0.086719755
 80  810  0.003003158  0.003200157   0.003018527
 80 1215  0.000018436  0.000020288   0.000018489
 90  270  0.992750081  0.989910160   0.992945576
 90  405  0.634046052  0.634103438   0.636388630
 90  607  0.097405383  0.100601135   0.097919297
 90  910  0.003449877  0.003649706   0.003465990
 90 1365  0.000021408  0.000023304   0.000021470

The additional column (approx2) is a second order approximation, with a CLT-like correction to the above formula:

$$P(Y>0 \mid \sum y=k)=\frac{P(\sum y_i=k\mid Y>0)}{P(\sum y_i=k)} P(Y>0)$$

The first approximation assumes the fraction is one (because of the law of large numbers, one expects that the conditioning is asympotically irrelevant), while the second approximation computes the fraction using (a zero order approximation) for the CLT, with a Poisson in the denominator and a truncated Poisson in the numerator. Higher order approximations could be obtained via Edgeworth expansions, but they would quickly become clumsy, as higher order moments of a truncated Poisson are involved.