Disclaimer: It's hard to state this question concisely, so don't confuse it with simpler selections questions. OTOH, I'm sure it's been asked, I just can't find it because I can't guess how someone else would phrase it.
Question:
Suppose I have a palette of $s$ distinct colors and have $m$ people independently choose their favorite $d$ colors from that set.
- How many ways are there for them to choose $n$ or fewer distinct colors across their selections?
- Is their a short name (or description) for this kind of selection / problem?
Motivation:
This is interesting because if 100 people chose their top 5 colors from a palette of 34 colors, it would be pretty surprising if they only came up with 6 distinct colors and as a group "neglected" the other 28. I am using this to determine a one-sided p-value for them having $n$ or fewer (i.e. very similar) selections.
Clarifications:
- Each person chooses $d$ distinct colors (so each samples without replacement).
- However, all the colors are available to everyone, so different people can include some or all the same colors in their selections.
My solution so far:
Lower Bound: $${{n \choose d}^m}$$ Upper Bound: $${s \choose n}{{n \choose d}^m}$$
The exact answer is in between these.
The question of getting exactly $n$ different colors points us to generalized Stirling numbers and hence may be computed by inclusion-exclusion the same as ordinary Stirling numbers. Supposing that we have chosen the $n$ colors from the $s$ available ones we classify choices according to the number $q$ of colors that are missing and obtain by PIE
$${s\choose n} \sum_{q=0}^n {n\choose q} (-1)^q {n-q\choose d}^m.$$
This yields for the probability of at most $n$ colors appearing
$$\bbox[5px,border:2px solid #00A000]{ {s\choose d}^{-m} \sum_{p=1}^n {s\choose p} \sum_{q=0}^p {p\choose q} (-1)^q {p-q\choose d}^m.}$$
I verified this formula using the Maple code included below.
with(combinat); ENUM := proc(s, d, m, n) option remember; local src, res, part, psize, mset, admit, recurse; src := choose({seq(q, q=1..s)}, d); admit := [seq(0, q=1..m)]; recurse := proc(sofar, srcpos, count, incl) local seen; if incl then admit[count] := admit[count] + 1; fi; if srcpos > nops(src) or count = m then return; fi; recurse(sofar, srcpos+1, count, false); seen := sofar union src[srcpos]; if nops(seen) <= n then recurse(seen, srcpos+1, count+1, true); fi; end; recurse({}, 1, 0, false); res := 0; part := firstpart(m); while type(part, `list`) do psize := nops(part); mset := convert(part, `multiset`); res := res + admit[psize] * m!/mul(p!, p in part) * psize!/mul(p[2]!, p in mset); part := nextpart(part); od; res*binomial(s,d)^(-m); end; X := (s, d, m, n) -> binomial(s,d)^(-m) * add(binomial(s,p) * add(binomial(p,q)*(-1)^q*binomial(p-q,d)^m, q=0..p), p=1..n);Addendum. We should get probability one when we have $n=s.$ To prove this we write
$${s\choose d}^{-m} \sum_{p=0}^s {s\choose p} \sum_{q=0}^p {p\choose q} (-1)^{p-q} {q\choose d}^m \\ = {s\choose d}^{-m} \sum_{q=0}^s {q\choose d}^m (-1)^q \sum_{p=q}^s {s\choose p} {p\choose q} (-1)^{p}.$$
Observe that
$${s\choose p} {p\choose q} = \frac{s!}{(s-p)! \times q! \times (p-q)!} = {s\choose q} {s-q\choose s-p}$$
so this is
$${s\choose d}^{-m} \sum_{q=0}^s {q\choose d}^m (-1)^q {s\choose q} \sum_{p=q}^s {s-q\choose s-p} (-1)^{p} \\ = {s\choose d}^{-m} \sum_{q=0}^s {q\choose d}^m (-1)^q {s\choose q} \sum_{p=0}^{s-q} {s-q\choose s-q-p} (-1)^{p+q} \\ = {s\choose d}^{-m} \sum_{q=0}^s {q\choose d}^m {s\choose q} \sum_{p=0}^{s-q} {s-q\choose p} (-1)^p.$$
Evaluating the inner sum we now find
$${s\choose d}^{-m} \sum_{q=0}^s {q\choose d}^m {s\choose q} \times [[q=s]] = {s\choose d}^{-m} {s\choose d}^m {s\choose s} = 1$$
and the sanity check goes through.
Remark. The details of the Stirling number argument are in the comments to MSE 2791477.