Upper bound of probability of not getting all values in independent draws

249 Views Asked by At

We draw $m$ uniform independent random values among $n$, with $m\ge n$. We consider the probability $p(m,n)$ that not all $n$ values have been drawn. We want an upper bound within a constant factor.

The probability that a given value was not drawn is $q(m,n)=(1-1/n)^m$. If these probabilities were independent, we could use $1-(1-q(m,n))^n$ for $p(m,n)$. However the probabilities are dependent and that expression yields dead wrong values.

How can we rigorously derive an upper bound? I needs not be tight, I'd be happy with something within any constant factor.


The question's motivation is to prove that $\displaystyle\lim_{n\to+\infty}p(n\left\lceil\log_2(n)\right\rceil,n)=0$, if possible with an explicit upper bound (towards answering an often-asked question about cryptographic hashes reaching their whole stated destination set, under a random oracle model).

This itself is plausible because (as studied in the coupon collector's problem) the expected number of draws to get all $n$ values is $n(\ln(n)+\gamma)+\frac12+\mathcal O(\frac1n)$, and $n\left\lceil\log_2(n)\right\rceil$ grows faster than that, by a factor $\frac1{\ln(2)}>1$.

3

There are 3 best solutions below

4
On BEST ANSWER

Edit: Define the indicator variable $X_{ij}=\mathbf{1}[ball ~j ~falls~in~ bin ~i]$ with $Y_i=\sum_{j=1}^m X_{ij}$ the load in bin $i.$ This is a sum of independent $X_{ij}$'s.

The expected load of bin $i$ is $\mu=m/n.$ The lower tail of the Chernoff bound is $$ \mathbb{P}[Y_i\leq (1-\delta)\mu]< \left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^\mu. $$ Choose $\delta$ such that, say, $(1-\delta)\mu=0.99,$ (i.e., $\delta=1-\frac{0.99}{\mu}$) which means that $$ \mathbb{P}[Y_i\leq (1-\delta)\mu]=\mathbb{P}[Y_i\leq 0.99]= \mathbb{P}[Y_i=0]<\left(\frac{e^{-\delta}}{(1-\delta)^{(1-\delta)}}\right)^\mu\quad (*). $$ Now $m=\lceil \lg n \rceil n,$ or $\mu=\lceil \lg n\rceil$ where $\lg(\cdot)$ is $\log_2(\cdot).$ If my arithmetic is correct, equation $(*)$ becomes essentially $$ \mathbb{P}[Y_i=0]< \exp\left(-\mu+0.99-0.99\ln\left(\frac{0.99}{\mu}\right)\right)\leq \exp\left(-\mu+0.99\right) $$ and ignoring the ceiling in $\lg n$ yields a looser upper bound namely $$ \mathbb{P}[Y_i=0]<\exp\left(-1.4426 \ln n+0.99\right) $$ and if $n$ is large enough to ignore the $+0.99$ term, we obtain $$ \mathbb{P}[Y_i=0]<n^{-1.4426}. $$ Now apply the union bound over the $n$ bins to get $$ \mathbb{P}[some~bin~empty]=\mathbb{P}[\exists i: Y_i=0]<n^{-0.4426} $$


Theorem: (Theorem 5.13 in Probability and Computing, Mitzenmacher and Upfal)

Let $X$ be the number of coupons observed before obtaining one of each of $n$ coupons. Then for any constant $c$ $$ \lim_{n\rightarrow \infty} \mathbb{P}[X>n \ln n+ c n]= 1-\exp(-exp(-c)). $$

Already for $c=5,$ the right hand side is as small as $0.007.$

Proof outline: Poisson approximation is correct in the limit $n\rightarrow \infty.$ So assume the number of balls in each bin is a Poisson r.v. with mean $\mu=\ln n+c$ which gives that the probability that a specific bin is empty is $$ \frac{\exp(-c)}{n} $$ and since the bin loadings are independent under the Poisson approximation the prob. that no bin is empty is $$ \left(1-\frac{\exp{-c}}{n}\right)^n \approx \exp(-\exp(-c)). $$

The proof is delicate and proceeds by splitting the event (where dependence on $n$ is suppressed in the notation) $$ {\cal E}=\{no~ bin~ is~ empty\} $$ as $$ \mathbb{P}({\cal E})=\mathbb{P}({\cal E} ~\cap ~(|X-\mu|\leq \sqrt{2m \ln n}))+ \mathbb{P}({\cal E} ~\cap ~(|X-\mu|> \sqrt{2m \ln n}))\quad (1) $$ and showing that:

  1. $\mathbb{P}(|X-\mu|> \sqrt{2m \ln n})$ is $o(1).$

  2. $\left| \mathbb{P}({\cal E}~|~|X-\mu|\leq \sqrt{2m \ln n})-\mathbb{P}({\cal E}~|~X=m)\right|$ is $o(1).$

  3. guarantees that the second term on the right of $(1)$ is $o(1).$

  4. guarantees that the difference between the experiment resulting in exactly $\mu$ vs nearly $\mu$ balls makes an asymptotically negligible difference in the probability that every bin has a ball.

Complements: Over 99.3 percent of the time the number of coupons required lies in $[n \ln n - 5n, n \ln n+ 5 n]$.

The relationship below may be of interest as well (Theorem 5.7 from the same book) for problems of this type involving balls and bins.

Theorem: Let $m$ balls be thrown uniformly independently into $n$ bins. Let $f(x_1,\ldots,x_n)$ be a nonnegative function, then $$ \mathbb{E}[f(X_1^{(m)},\ldots,X_n^{(m)}] \leq e \sqrt{m} ~\mathbb{E}[f(Y_1^{(m)},\ldots,Y_n^{(m)}], $$ where $X_i^{(m)}$ is the number of balls in the $i^{th}$ bin, for $1\leq i\leq n,$ and where $Y_i^{(m)}$ are i.i.d. Poisson with mean $m/n.$

3
On

By the standard urns and balls interpretation, we have $$p(m,n)=1-\frac{n! \, S_{m,n}}{n^m}$$

where $S_{m,n}$ is the Stirling number of the second kind. Then the question is equivalent to ask for a lower bound for $S_{m,n}$. There are many, but don't expect something very neat... or practical. For example. Or this one.

Added: Still not an aswer, but another plausibility argument. The number of balls can be approximated (Poissonization) by $n$ iid Poisson variables with $\lambda=m/n$. Then the probability that all urns are occupied is

$$ 1-p(m,n) \approx \left( 1 - \exp(- \lambda)\right)^n$$

If we take $m = a \log(n) n$ and we assume that the approximation is asympotically precise, then for large $n$ we get

$$ 1-p(m,n) \approx \exp(- n^{1-a}) \to \begin{cases} 0 & a<1 \\ 1 & a>1\end{cases}$$

2
On

Approximation of the Probability

The probability that a particular $j$ values have been missed is $$ N_j=\overbrace{\quad\binom{n}{j}\quad}^{\substack{\text{number of}\\\text{ways to choose}\\\text{the $j$ missed}}}\overbrace{\left(1-\frac jn\right)^m\vphantom{\binom{n}{j}}}^{\substack{\text{probability of}\\\text{missing those}\\\text{$j$ values}}}\tag1 $$ The Principle of Inclusion-Exclusion says that the probability of missing at least one value is $$ p(m,n)=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\left(1-\frac jn\right)^m\tag2 $$ Therefore, $$ \begin{align} p(n\log_2(n),n) &=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\left(1-\frac jn\right)^{n\log_2(n)}\tag{3a}\\ &\sim\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}e^{-j\log_2(n)}\tag{3b}\\[3pt] &=\left(1-\left(1-e^{-\log_2(n)}\right)^n\right)\tag{3c}\\[9pt] &=\left(1-\left(1-\tfrac1{n^{1/\hspace{-1pt}\log(2)}}\right)^n\right)\tag{3d}\\[9pt] &\sim n^{1-1/\hspace{-1pt}\log(2)}\tag{3e}\\[12pt] &\to0\tag{3f} \end{align} $$ Explanation:
$\text{(3a)}$: apply $(2)$
$\text{(3b)}$: Poisson approximation
$\text{(3c)}$: Binomial Theorem
$\text{(3d)}$: $e^{-\log_2(n)}=\frac1{n^{1/\hspace{-1pt}\log(2)}}$
$\text{(3e)}$: Binomial approximation
$\text{(3f)}$: $1-1/\log(2)\approx-0.442695$


Comparison of Values $$ \begin{array}{c|c|c} n&p(n\log_2(n),n)&n^{1-1/\hspace{-1pt}\log(2)}\\\hline 10&0.275663&0.360832\\ 100&0.118840&0.130200\\ 1000&0.045682&0.046980\\ 10000&0.016798&0.016952 \end{array} $$