How to pick a random sample of given size from a set of unknown size

1.7k Views Asked by At

I have the following problem: I'm reading huge amounts of data records; when done, I would like to display one or more randomly selected records from the data set. It's easy to do if I can cache the data—in that case, it boils down to selecting a few random indexes. It is also easy if I know whether I will get to see a hundred or a million records.

But I don't know how many records I will get to see before the file has been read completely, so it would appear that I'll have to adjust the likelihood of picking a given record as time passes.

Surely, when I'm asked to display three random entries and I'm just reading the first, second or third record, the probability $p$ of displaying them should be be 1—after all, in case the data file ends at that point, that's the bes$ i can do.

And indeed, according to xkcd, that's already good enough. on the downside, if I stop just there, I will always see the same sample when I read the same data again. So on reading the fourth record, there should be a probability $\le1$ that I have to replace any one of the records in the display list with the fourth record. But if I keep $p$ at 1 or very close, then subsequent records will dominate the results.

Is this possible to do? Maybe there's a practical approximation for the problem?

2

There are 2 best solutions below

3
On BEST ANSWER

Suppose you want to find $n$ random samples from the entire file, and you have already read $m\ge n$ data points and selected $n$ samples among those in a way that we will assume is uniform.

Now you see sample number $m+1$. If the file ended here, the sample you just saw should be in the output with probability $\frac{n}{m+1}$. Make a random choice with that probability. If it comes out false, you do nothing. Otherwise, evict a random one among the $n$ saved samples from the first $m$ positions and overwrite it with data item $m+1$.

It should be clear that the $n-1$ samples that don't get evicted forms a uniformly chosen set of $n-1$ samples from the first $m$ ones. So when you do chose to use data point number $m+1$ you get a uniform distribution among all size-$n$ sets that satisfy that choice. And otherwise you already have an uniformly chosen set of $n$ points that doesn't contain item $m+1$. So, all in all, you now have a uniformly chosen set of $n$ among the $m+1$ first data items.

Proceed by induction until you reach the end of the file.

2
On
  1. Assign $s\leftarrow 0$ (the number of items "seen") and prepare an array to hold $n$ records $x_1,\ldots, x_n$
  2. If there is no more input available, terminate: $x_1,\ldots x_n$ is your sample (provided $s\ge n$, as otherwise no such sample is possible)
  3. Otherwise, read an item $x$ from input and let $s\leftarrow s+1$
  4. If $s\le n$, let $x_s\leftarrow x$
  5. Otherwise, let $p=\frac{n}{s}$. With probability $p$, do the following: Let $i$ be a random number uniformly picked from $1,\ldots,n$ and set $x_i\leftarrow x$
  6. Go back to step 2

This generates up to two random numbers per loop and never stores more items than necessary for a sample. At any point during the process (once the initial $n$ items have been read), the array contains a valid sample of the input so far.