Dealing with large lists of numbers

106 Views Asked by At

I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.

Some context:
I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.

Thoughts on my problem:
Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.

So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).

References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.

2

There are 2 best solutions below

0
On

You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data. As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. The number of intervals is the number of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.

0
On

The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).

It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.

I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.


Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.

Checkpointing

In addition to their primary roles in scientific computation, cores must also perform checkpointing by saving the exact state of the calculation to disk periodically, because a volunteer's PC may crash or be turned off at any time. Since work units can run for days, checkpointing reduces the amount of work that is lost to several minutes when the client resumes operation. Most applications do not normally include checkpointing, or do so without the complete state needed, so adding this capability is the most significant part of making code that is designed to run on a single system run in a distributed system [1]

Some more discussion on Fault-tolerance and thread migration in [3].

Load balancing

Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]

References

[1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922

[2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6

[3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5