I have the following problem: I'm reading huge amounts of data records; when done, I would like to display one or more randomly selected records from the data set. It's easy to do if I can cache the data—in that case, it boils down to selecting a few random indexes. It is also easy if I know whether I will get to see a hundred or a million records.
But I don't know how many records I will get to see before the file has been read completely, so it would appear that I'll have to adjust the likelihood of picking a given record as time passes.
Surely, when I'm asked to display three random entries and I'm just reading the first, second or third record, the probability $p$ of displaying them should be be 1—after all, in case the data file ends at that point, that's the bes$ i can do.
And indeed, according to xkcd, that's already good enough. on the downside, if I stop just there, I will always see the same sample when I read the same data again. So on reading the fourth record, there should be a probability $\le1$ that I have to replace any one of the records in the display list with the fourth record. But if I keep $p$ at 1 or very close, then subsequent records will dominate the results.
Is this possible to do? Maybe there's a practical approximation for the problem?
Suppose you want to find $n$ random samples from the entire file, and you have already read $m\ge n$ data points and selected $n$ samples among those in a way that we will assume is uniform.
Now you see sample number $m+1$. If the file ended here, the sample you just saw should be in the output with probability $\frac{n}{m+1}$. Make a random choice with that probability. If it comes out false, you do nothing. Otherwise, evict a random one among the $n$ saved samples from the first $m$ positions and overwrite it with data item $m+1$.
It should be clear that the $n-1$ samples that don't get evicted forms a uniformly chosen set of $n-1$ samples from the first $m$ ones. So when you do chose to use data point number $m+1$ you get a uniform distribution among all size-$n$ sets that satisfy that choice. And otherwise you already have an uniformly chosen set of $n$ points that doesn't contain item $m+1$. So, all in all, you now have a uniformly chosen set of $n$ among the $m+1$ first data items.
Proceed by induction until you reach the end of the file.