Number of resamples with repetition where the number of unique elements are fixed

47 Views Asked by At

This comes from Bootstrap method, but I am not sure which would be the right keyword to describe the quantity I want to calculate.

There is a set $\mathbb{A}$ with $N$ elements, where all elements are distinct. For instance, $\mathbb{A}=\{a, b, c\}$ for $N=3$.

Suppose we draw $N$ elements from $\mathbb{A}$, but allowing the repetition. The number of possible cases, $P(N)$, would be $$P(N)=\left(\left(\matrix{N\\N}\right)\right)=\frac{\left(2N-1\right)!}{N!\left(N-1\right)!}$$ Let's call this new set we draw as a resample. For instance, $\{a,a,b\}$ is a resample, $\{a,b,c\}$ is also one, and $\{c,c,c\}$ is also a resample.

Now, what I want to do is, make a random resample, then count the number of unique elements in the resample. For instance, $\{a,a,b\}$ has 2, $\{a,b,c\}$ has 3, and $\{c,c,c\}$ has only one.

Let $P(N, k)$ be the number of resamples with $k$ unique elements. By definition, it should be obvious that $$ P(N) = \sum_{k=0}^{k=N} P(N,k) $$

The question is, what is $P(N, k)$?

I approached this in the following way. When we have $k$ unique elements, we first have to choose $k$ element without repetition from $N$ elements. Then, we have $N-k$ elements that are repeated, which we could just put each in one of the $k$ possible choices. So, I thought $$ P(N,k) \stackrel{?}{=} {}_{N}C_k\cdot k^{N-k} = \frac{N!}{k!(N-k)!}\cdot k^{N-k} $$ For instance, $P(3,0)=0$, $P(3,1)=3$, $P(3,2)=6$, and $P(3,3)=1$. Summing them all gives $P(3)=10= \sum_{k=0}^{k=3} P(3,k)$, which is good.

But, it does not satisfy $P(N) = \sum_{k=0}^{k=N} P(N,k)$ for $N>3$.

Where have I got it wrong? And what is the correct form for $P(N,k)$?

1

There are 1 best solutions below

1
On BEST ANSWER

It's less confusing to answer the more general question but for drawing $m$ elements from the set. Since you use stars-and-bars it sounds like you want to consider the result as a multiset. (This makes it a little delicate what counts as a "random resample." For example, if you randomly resample by choosing each of the new elements uniformly at random, one at a time, the resulting probability distribution over multisets is not uniform.)

So we want to count the number of multisets of $m$ elements from a set $A$ of $n$ elements in which $k$ distinct elements of the set occur. Choosing such a multiset amounts to choosing a $k$-element subset of $A$ (which e.g. we sort according to some total order on $A$), then choosing multiplicities $a_1, \dots a_k \ge 1$ satisfying $\sum a_i = m$ (one for each of the elements in our $k$-element subset). This can be done in ${n \choose k} {m-1 \choose k-1}$ ways, which sums up to the total number of multisets of $m$ elements from a set of $n$ elements

$$\sum_k {n \choose k} {m-1 \choose k-1} = {n+m-1 \choose m-1}$$

by Vandermonde's identity (writing ${m-1 \choose k-1} = {m-1 \choose m-k}$). Taking $m = n$ gives $\boxed{ P(n, k) = {n \choose k} {n-1 \choose k-1} }$.

Your argument is incorrect because it doesn't take into account that it doesn't matter what order you assign the repeated elements in; in a multiset only the multiplicities matter. Of course if you aren't talking about multisets and you are really talking about a sequence of elements of $A$, where order matters, then that would be a different counting problem. So you need to get clear which problem you're interested in.