Cumulative Coupon Collecting Problem : N until $\sum_{\text{unique } x} p(x) > \epsilon$

65 Views Asked by At

Coupon collecting problem is well understood. But in many cases you don't care about collecting the extremely rare "coupons", but about having all the "coupons" you might care about.

E.g. imagine you are learning English and only "understand" a word once you heard it. Then standard coupon collecting problem is essentially: "how many words should I hear until I understand every word in English". As many words are very uncommon in English, a more natural question is "how many words should I listen to, so that I will understand 99% of words (not unique) that I hear". I.e. you want the cumulative distribution of all words you never heard about to be smaller than some $\epsilon$.

Specifically, let $x \in \mathcal{X}$ be a "coupon" and $\mathcal{S}_t \subseteq \mathcal{X}$ be the set of (unique) coupons seen after sampling $t$ i.i.d coupon. I was wondering if the cumulative distribution of seen coupons had already been studied, and if not how to study it:

$$CumCoupon(t) = \sum_{x \in \mathcal{S}_t} p(x)$$

The questions of interests could be:

  1. expected cumulative distribution seen after $t$: $\mathbb{E}[CumCoupon(t)]$
  2. probability of having a cumulative distribution larger than some threshold: $p(CumCoupon(t) > 1-\epsilon)$
  3. Staying in the initial example of understanding the English language, an interesting question is whether there are some approximations of the 2 previous answers for the Zipf distribution (as words are often assumed to be distributed according to zipf).

Note that compared to coupon collecting problem the above variant can also be interesting if there's an infinite (but countable) number of coupons. In the standard problem, this would of course require an infinite $t$ to collect each coupon, but here you could "cover" most of the distribution (i.e. high $CumCoupon(t)$) with a small and finite $t$ (e.g. in the case of power law / zipf distributions).