How to correctly sample lines from large file when order and lines count are unknown?

62 Views Asked by At

Input:

  • potentially large file with text lines
  • lines count is unknown, naive assumption gives range maybe more than 100K lines, but less than 400K
  • it's unknown if lines are sorted or not
  • it's fine to keep in memory at most 10.000 lines

Limitations:

  • read lines one by one, there is no possibility to read file at once into memory.
  • read file once and sample lines at the same time, it's way too long to read all lines first to get lines count.

Goal:

Sample the minimum count of lines from file to get maximum variance

Example: Here is the file:

1,Tom,22,3000000
3,Bob,34,4000000
...
10,Tim,18,500000.05

Lines are ordered, I expect to sample

1,Tom,22,3000000
10,Tim,18,500000.05

Solution: Does it make sense? - read line by line - keep tail buffer for the recent 1000 lines - keep sample buffer for 9000 lines - add every line with index XXX (what is the right formula here?) to sample buffer - union sample buffer and tail buffer once end of file reached.

1

There are 1 best solutions below

3
On BEST ANSWER

This sounds like the classic Reservoir Sampling use case. Reservoir sampling is a well studied family of algorithms addressing this problem. In reservoir sampling only the size of the resulting sample is kept in memory, not the entire input file.

The only difference in the problem statement from the classic algorithm is that it sounds like the intent is to preserve the input order. This can be done by saving the line number for each record and sorting prior to writing out the sample.

GNU Shuf is a popular and widely available implementation. It does not support retaining the original line order though. tsv-sample is an open-source program similar to GNU Shuf. It does support the preserving input order (disclaimer: I'm the author).