The proportion between distinct labels in a multiset and the total amount of labels

68 Views Asked by At

Say we have a (multi)set $\alpha$ of $n$ balls, each of them is labeled with a number in $\{1,\ldots,m\}$ (where $m<n$ ). Denote by $d$ the amount of distinct labels in $\alpha$. Is it true that there exists a constant $c$ such that for every $\alpha$ as such, if we uniformly select a subset of $\alpha$ of size $\sqrt{m}$, then with high(say $2/3$) probability (over the selected subset) the number of distinct labels in the subset we picked is not greater than $c\sqrt{d}$ (for every $m$) ?

1

There are 1 best solutions below

1
On BEST ANSWER

I wonder whether the question now says what you intended it to say; it seems strange to have two different variables for the number $d$ of actual labels and the number $m$ of potential labels that aren't actually used.

The answer is no. You can always choose $m=n-1$. For any given $d$, you can make $n$, and thus $m$, and thus $\sqrt m$, arbitrarily large by having an arbitrarily large number $k$ of instances of each label, resulting in $n=kd$. Thus, you can pick an arbitrarily large number $\sqrt m$ of balls out of a population where every one of a fixed number of $d$ distinct labels has the same probability, and the probability of picking anything other than a full set of $d$ distinct labels goes to zero as $k$ goes to infinity. Since $d$ cannot be bounded by $c\sqrt d$ with constant $c$, there is no such constant. (The fact that you're drawing without replacement only works in favour of the argument, since it increases the probability of drawing distinct labels.)