How can I estimate the number of repeated selections

75 Views Asked by At

I have a very challenging problem that I cannot find a way to solve without python simulation.

Given a dataset of size X (very large number), we want to select H entries from X without replacement and the order doesn't matter. Then, repeat this process N times.

Selecting H from X follows a uniform distribution.

How can I estimate the total numbers of entries that would be selected multiple times?

I know that for a very large X, selection without replacement is not going to be that different from selection with replacement, so I tried modeling it using the binomial theorem, but I cannot wrap my head around how to start the calculation.

1

There are 1 best solutions below

0
On BEST ANSWER

For each entry, the probability it is selected in a particular round is $H/X$. Therefore, over the course of $N$ rounds, the number of times that entry is selected follows a binomial distribution with parameters $N$ and $H/X$. In particular, the probability the entry is selected twice or more is $$ \text{probability of being selected twice or more } = 1-(1-p)^N-\binom N1p(1-p)^{N-1}\tag{$*$} $$ where $p=H/X$. Finally, using linearity of expectation, the expected number of entries selected multiple times is the number of entries times the probability for each entry. $$ \text{expected # entries selected twice or more }=X\left[1-(1-p)^N-\binom N1p(1-p)^{N-1}\right] $$ Furthermore, let $q$ be the probability in $(*)$. The expected number of entries selected twice or more is $qX$, but what the variance? We cannot use a binomial distribution, since the events of being selected twice or more are dependent. However, since these events are negatively correlated, the variance is bounded above by the variance of a binomial. That is, $$ \text{variance of # entries selected twice or more }\le q(1-q)X. $$ So now you know the mean, and have an upper bound for the spread.