I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.
Some context:
I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.
Thoughts on my problem:
Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.
So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).
References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.
You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data. As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. The number of intervals is the number of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.