Amount of distinct numbers in a sequence of $k$ random numbers in range $[1,\ldots,n]$

169 Views Asked by At

Let $D$ be the amount of distinct numbers in a sequence of $k$ random numbers in range $[1,\ldots,n]$ (n>k). I want to show that: $D=\Omega(k)$ with exponential high probability. I'm interested in the case where $k/n \rightarrow 0$, e.g., $k=\sqrt{n}$.

I tried the following approach:

$D$ is the amount of nonempty bins when throwing $k$ balls into $n$ bins. $Y$ is the amount of empty bins when throwing $k$ balls into $n$ bins. So, $D=n-Y$.

Now, using Chernoff bound and the fact that $\text{E}[Y]=ne^{\frac{-k}{n}}$:

$$\begin{align}\Pr(Y>(1+\delta)\text{E}[Y])\le 2^{-\delta \text{E}[Y]}\\ \Pr(D<n-(1+\delta)\text{E}[Y])\le 2^{-\delta \text{E}[Y]}\end{align}$$

where $\delta>2e-1$.

I want to find $\Pr(D<\beta k)$, but to do so I have to take $\delta=\frac{n-\beta k-\text{E}[Y]}{\text{E}[Y]}$ which is negative since $\text{E}[Y]\rightarrow n$.

What I'm doing wrong, or there is some other way to show that $D=\Omega(k)$ with exponential high probability? Or maybe this statement is wrong?

Thanks!

1

There are 1 best solutions below

13
On BEST ANSWER

so I have to take $\delta=\frac{n-\beta k-\text{E}[Y]}{\text{E}[Y]}$ which is negative since $\text{E}[Y]\rightarrow\infty$ (first version of the question), or $\text{E}[Y]\rightarrow n$ (revised version of the question).

Not quite...

  • First, the mean of $Y$ is not what you write but $\mathbb E(Y)=n\left(1-\frac1n\right)^k$.
  • Second, $\delta$ has the sign of the numerator divided by $n$, which is $$ \gamma=1-\beta\frac{k}n-\left(1-\frac1n\right)^k. $$ For every fixed $\beta\lt1$, when $n\to\infty$ with $k\ll n$, one gets $$ \gamma=1-\beta\frac{k}n-\left(1-\frac{k}n+o\left(\frac{k}n\right)\right)\sim(1-\beta)\frac{k}n, $$ hence $\gamma\gt0$ for every $n$ large enough.

Edit: One can control the distribution of $D$ as follows. Consider the event $[D\leqslant k-i]$ and choose the $k$ numbers sequentially. Then $D\leqslant k-i$ means that at least $i$ times, one chooses an already chosen number. This happens each time with probability $\leqslant\frac{k}n$ and there are $\binom{k}i\leqslant2^k$ ways of choosing these $i$ moments, hence $\mathbb P(D\leqslant k-i)\leqslant2^k\left(\frac{k}n\right)^i$ $(\ast)$.

For example, $\mathbb P(D\leqslant\frac12k)\leqslant\mathrm e^{-\Lambda(n,k)}$ with $\Lambda(n,k)=-\frac12k\log\left(\frac{k}n\right)-k\log2$. One sees that, if $1\ll k\ll n$, then $\Lambda(n,k)\gg k$ hence the control on $\mathbb P(D\leqslant\frac12k)$ is at least exponential in $k$.

Edit 2: To see that $(\ast)$ holds, for every subset $I$ of $\{1,2,\ldots,k\}$ of size $i$, call $A_I$ the event that at every time in $I$ one chooses an already chosen number and at every time not in $I$ but earlier than some element in $I$ one chooses a new number (at every time not in $I$ later than the whole set $I$, the choice is free). Then the events $A_I$ are disjoint and their union over every $I$ of size $i$ included in $\{1,2,\ldots,k\}$ is $[D\leqslant k-i]$. There are exactly ${k\choose i}\leqslant2^k$ such sets $I$, and for every such set $I$, $\mathbb P(A_I)\leqslant\left(\frac{k}n\right)^i$. The upper bound of $\mathbb P(D\leqslant k-i)$ follows.