Use of Chebyshev's inequality in the coupon collector problem

78 Views Asked by At

From the note: https://terrytao.wordpress.com/2015/10/23/275a-notes-3-the-weak-and-strong-law-of-large-numbers/comment-page-1/#comment-682897.

In the coupon collector problem in which one considers an infinite sequence of “coupons” that are iid and uniformly distributed from the finite set ${\{1,\dots,N\}}$, let ${T_N}$ denote the first time at which one has collected all ${N}$ different types of coupons, we can obtain the bounds $\displaystyle {\bf E} T_N = N \log N + O(N)$ and $\displaystyle {\bf Var} T_N = O(N^2)$. It is then stated in the note that from Chebyshev’s inequality, we thus see:

$\displaystyle {\bf P}( |T_N - N \log N| \geq \lambda N ) = O( \lambda^{-2} )$ for any ${\lambda > 0}$.

Question: Applying Chebyshev's inequality with the given bounds seems to give $\displaystyle {\bf P}( |T_N - N \log N - O(N)| \geq \lambda N ) = O( \lambda^{-2} )$, why can we disregard the $O(N)$ term in ${\bf E} T_N$ for this estimate?

1

There are 1 best solutions below

1
On BEST ANSWER

Chebyshev's inequality says $P(|X-\mu_X|\ge t) \le \frac{\sigma_X^2}{t^2}$

so we can say $P(|X-\mu_X + s|\ge t+|s|) \le \frac{\sigma_X^2}{t^2}$.

(Here $\mu_{T_N}=N H_n$ is close to and bounded above by $N\log_e(N)+ \gamma N +\frac12$ but we will use the upper bound $N\log_e(N) +c_1 N$, while $\sigma_{T_N}^2$ is close to and bounded above by $\frac{\pi^2}{6}N^2$ though we will use the upper bound $c_2N^2$.)

So letting $X=T_N$ and $\mu_{T_N} = N\log_e(N)+s$ and $\lambda = \frac{t+|s|}{N}$ and so $t= N \lambda -|s|$, we get $P(|X-\mu_X + s|\ge \lambda N) \le \frac{c_2n^2}{(N \lambda - c_1\lambda)^2} = \frac{c_2N^2}{(N - c_1)^2} \lambda^{-2}$ which is $O(\lambda^{-2})$ as desired.

You need $N>c_1$ for this argument, but for smaller $N$ you have the probability bounded above by $1$.


It is not necessary for the argument, but note that the mode of the distribution of $T_N$ is close to $N\log_e(N)$ but, since $T_N$ has a right-skewed distribution, its mean of $N H_n$ is larger. So for small $\lambda$, your are more likely to be outside a narrow interval around $N H_n$ than outside an interval of the same width around $N\log_e(N)$.

The probability mass function for $T_{100}$ looks something like this, with a blue vertical line indicating $N\log_e(N) \approx 460.5$ and a red line indicating the mean $N H_N \approx 518.7$:

pmf

Here is an illustration of what I think the probabilities of being outside the $\pm \lambda N$ intervals are for $0 \le \lambda \le 8$ when $N=100$, with the blue line for the intervals around $N\log_e(N)$ and the red lines for the intervals around $N H_n$. The blue line being below the read line on the left is due to $N\log_e(N)$ being near the mode; the red line being below the blue line on the right is due to $N H_n$ being closer to the heavier right-tail of the distribution.

probability outside interval

The same chart, but with the probabilities on a log-scale, suggests that $O\left(\lambda^{-2}\right)$ may be conservative, and the visually almost straight lines suggest that perhaps $O\left(e^{-\lambda}\right)$ might be possible; this improvement should not really be a surprise as the Chebyshev inequality is usually very loose in many circumstances.

log-scale