Can random sampling affect view of how much repetition there is in a data set?

43 Views Asked by At

$X$ unique individuals take $Y$ actions.

If I sample $1\%$ of the actions entirely at random what is the relationship between the number of unique individuals seen in my sample and $X?$

(In my actual use case $Y$ is $2\ 000\ 000$ and $X$ is unknown.)

1

There are 1 best solutions below

0
On

Here is a sketch-approximation: Suppose there are $X$ individuals and each takes $N$ actions. Since your sample of $Y$ is relatively small, we can ignore small differences between sampling with replacement and sampling without replacement. So we find ourselves in a situation where $SS$ samples from $X$ equally likely categories result in $U$ unique results. Expected time to get $U$ unique results is

$$X(\frac 1 X+\frac 1 {X-1}+\dots+\frac 1 {X-U+1})\approx X\log\frac{X+1}{X-U+1}\approx SS$$

Solve for $X$. Alternatively, consider you random sampling as drawing from $X$ equally likely categories. The expected number of unique draws $f$ is governed by the difference equation: $$\Delta f=1-\frac f X$$ solution to which is $$f(n)=X(1-(1-\frac 1 X)^n)$$ hence we get an equation: $$X(1-(1-\frac 1 X)^{SS})=U$$ It's easy to check that these two equations are approximately equivalent for large $X$. Rewrite the last equation (for large $X\ge U$) as $$e^{-k}=1-\alpha k$$ for ${k=\frac {SS}X}, \alpha=\frac U{SS}$.

If $\alpha\approx 1$, Taylor expansion yields $k\approx 2(1-\alpha)$ and $X\approx\frac {SS}{2(1-\alpha)}$

If $\alpha\approx 0$, $k\approx\frac 1 {\alpha}$ and $X\approx U$.