Coupon collector's problem with unequal probability and calculating the number of expected rare coupons

191 Views Asked by At

This is an instance of the "coupon collector's problem" in which not all the "coupons" have the same probability (more exactly, the probability of coupon $i$ follows Zipf distribution $p_i \sim Zipf(i)$.

There are $N$ distinct coupons. We keep collecting coupons till we have $N'$ distinct coupons. How many distinct rare coupons do we expect to collect? A rare coupon is defined as a coupon with probabilities less than $p_{N'}$. In other words, the top common $N'$ distinct coupons are not rare and the rest are rare.

Can someone point me in the right direction?

1

There are 1 best solutions below

0
On BEST ANSWER

I’ll do this for separate numbers $m$ of distinct coupons collected and $n$ of distinct coupons regarded as common because I don’t see any simplification allowed by the special case $m=n=N'$.

By linearity of expectation, the expected number of distinct rare coupons collected is the sum over $i\gt n$ of the probabilities $c_i$ of coupon $i$ being collected.

This is a generalization of Last coupon collected in the coupon collectors problem, which amounts to the special case $m=N-1$. We can solve the more general problem with the same Poissonization approach.

So let coupons be drawn in a continuous Poisson process with rate $1$ that is the sum of $N$ independent Poisson processes with rates $p_i$.

The probability density for the process for coupon $i$ to yield its first coupon at time $t$ is $p_i\mathrm e^{-p_it}$. At that time, the probability that $m$ distinct coupons have not yet been collected is

$$ \sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\prod_{j\in S}\mathrm e^{-p_jt}\;, $$

where $\ell=N-m$, $P_i$ is the set of sets of coupons other than $i$ with at least $\ell$ coupons, and $(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}$ is minus the Möbius function on $P_i\cup\{\emptyset\}$, ordered by inclusion, which you can check from

$$ \sum_{T\supset S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}=\sum_{\ell\le k\le|T|}(-1)^{k-\ell}\binom{k-1}{\ell-1}\binom{|T|}{k}=1\;. $$

Thus, the probability $c_i$ that when coupon $i$ is first collected $m$ distinct coupons have not yet been collected, and thus coupon $i$ gets collected, is

\begin{eqnarray} c_i &=& \int_0^\infty\mathrm dtp_i\mathrm e^{-p_it}\sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\prod_{j\in S}\mathrm e^{-p_jt} \\ &=& \sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. \end{eqnarray}

For $m=N-1$ and thus $\ell=1$, we recover the probability for coupon $i$ not to be the last coupon collected,

$$ c_i=\sum_{S\subset P_i}(-1)^{|S|-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. $$

For $m=1$ and thus $\ell=N-1$, the sum has a single term, in which $S$ is the set of all $\ell$ coupons other than $i$, and we recover the trivial result $c_i=p_i$.

The expected number of rare coupons collected is

$$ \sum_{i\gt n}c_i=\sum_{i\gt n}\sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. $$

We can write this as a single sum by collecting the terms with the same denominator:

$$ \sum_{i\gt n}c_i=\sum_T(-1)^{|T|-\ell-1}\binom{|T|-2}{\ell-1}\frac{\sum_{n\lt j\in T}p_j}{\sum_{j\in T}p_j}\;, $$

where the sum now runs over all sets of at least $\ell+1$ coupons.

I don’t see how either of these representations could be simplified for the specific case of a Zipf distribution.