Unique number of elements in random sample of dataset with repeated values

36 Views Asked by At

Suppose you have a dataset df containing N observations of integers 1 to k = N/j for some integer j with equal frequency (i.e. each observation occurs j times). Suppose you take a random sample (dfsample) of n observations, without replacement. Let X represent the number of unique observations in dfsample. What is E[X]? What is the pdf of X? Generally speaking, how does this change as you vary the distribution of values in the original dataset df? Is E[X] a function of entropy or some other representative measure of variance, or is it more complicated?

You may assume N >> k and N >> n, but n<>k.

1

There are 1 best solutions below

0
On

The expectation is not difficult when you start with an equal $j$ copies of each of $k$ integers with $N=jk$ since you can look at the probability of not finding a particular integer. You get: $$\mathbb E[X] = k\left(1-\dfrac{{N-j} \choose {n}}{{N \choose n}}\right)$$ and this can easily be extended to starting with $j_i^{\,}$ copies of each of the integers $i$ from $1$ to $k$ with $N=\sum\limits_{i=1}^k j_i^{\,}$ $$\mathbb E[X] = \sum\limits_{i=1}^k \left(1-\dfrac{{N-j_i} \choose {n}}{{N \choose n}}\right)$$ which in terms of your question "a function of entropy or some other representative measure of variance, or is it more complicated" might be seen as more complicated, but not excessively difficult.

For given $N$ and $k$, I would intuitively think that the first expression is an upper bound on the second.

The actual distribution is unlikely to be simple. You could approach it by summing terms from the multivariate hypergeometric distribution, possibly using inclusion-exclusion.