How do I calculate the expected number of duplicates in a set based on the number of duplicates found only in a sample?

126 Views Asked by At

I have a set of 10,000 points of data. I know that some of those points of data are identical to other points of data (i.e. duplicates), but (for now) there are no triplicates, quadruplets, etc. This means that the total number of unique points of data is less than 10,000. I took a random sample of 1,000 points of data from that set and did brute force comparisons within that sample to discover that there are 40 duplicates in the sample, i.e. unique pairs of data. How many duplicates would I expect to find in the full set of data? Please show and explain how to calculate the answer.

For bonus points, how do we generalize this to allow for triplicates, quadruplets, etc.?

1

There are 1 best solutions below

1
On

A few estimators for this type of problem was proposed by Leo Goodman in the paper On the Estimation of the Number of Classes in a Population, and there is also some more relevant information in Sampling to estimate the number of duplicates in a database by Bunge and Handley, although they propose a sampling scheme where you actively search the full population for any duplicates of the records you have sampled.

The estimator Goodman proposed as "most suitable" (and which looks like is printed incorrectly in the abstract) looks something like this:

$$T' = \begin{cases} S' = N - \frac{N(N-1)}{n(n-1)} x_2 & \textrm{if } S' \geq \sum x_i \\ \sum x_i & \textrm{otherwise} \end{cases}$$

where $x_i$ is the count of values in the sample that appear $i$ times each, so $x_1$ is the number of singletons, $x_2$ is the number of duplicate values, etc., and $n$ and $N$ are the sample and population sizes respectively.

This is an estimator for the total number of distinct values in the population (i.e. if you only count duplicate values once each), which means that we need to make an adjustment if we want to estimate the number of duplicates instead. Thankfully, if we know (or assume) that there aren't any values that appear three or more times, we can say that $N = X_1 + 2 X_2$ (i.e. the population is made up of all the singletons plus twice the number of duplicated values), whereas $T = X_1 + X_2$ (i.e. the number of distinct values is the sum of the singleton and duplicated values), so $X_2 = N - T$ and hence our estimator becomes

$$X_2' = \begin{cases} \frac{N(N-1)}{n(n-1)} x_2 & \textrm{if } S' \geq x_1 + x_2 \\ N - (x_1 + x_2) & \textrm{otherwise}\end{cases}$$

So in your case, we have $N = 10000$, $n = 1000$, $x_2 = 40$, $x_1 = 920$. This gives $S' \approx 5996.4$, and so $X_2' = 10000-5996.4 \approx 4003.6$.

It's not clear to me what the variance on $S'$ is, though, and the Bunge and Handley paper suggests that it may not perform particularly well so I would suggest taking that estimate with a grain of salt. Presumably the issue is that you're needing to use the number of duplicates in the sample to estimate the number of sample records that might have another duplicate somewhere in the population, which is potentially using some very small proportions to extrapolate to the whole population.