Simplified problem: I have an array of points in 3D space. I want to find all pairs that are within a given distance from each other. (I'm writing a very simple simulation and the points merge into a new point when they 'hit' each other. So there are actually multiple types of points and multiple hit distances).
Each simulation cycle all particles are displaced by a random 3D vector (simulating random motion). This allows some optimizations to improve the speed of the algorithm across multiple cycles.
Fast solutions that compromise precision (e.g. catch only 80% of the hits or only catch hits within small voxel boundaries) are also welcome (speed and time complexity is actually my main concern).
Attempts so far
Now some examples of algorithms I've tried so far (I have no solid knowledge about quad trees or other more advanced mathematical concepts so my solutions are a bit improvised):
Using voxels: Multiply all coordinates by a certain factor $a$ and round them to obtain voxel coordinates. Convert the x, y and z voxel coordinate into a voxel number (using boundaries, e.g. on all axes there are voxel 0 to $n$), so the voxel number is $x\cdot n^2+y\cdot n+z$. Sort the list of voxel numbers and iterate to find particles inside the same voxel. This is probably one of the fastest method I have used so far. But obviously it compromises precision (due to voxel boundaries, hit distances close to the voxel size won't work very good either). The sorting order of the voxel numbers for each particle can be saved for the next cycle to slightly improve sorting speed.
Ordering by one axis and comparing all close pairs: it's very fast to order all particles by their x coordinate (or any other). Especially if you keep the sorting order from the previous cycle. So far I have not found an efficient method to compare the close pairs on one axis to the close pairs on another axis (I tried to use a Bloom filter to check if a close pair is present on the other axis but building the filter is way to slow, I'm pretty sure the same goes for a hash set since the number of close pairs on a single axis is much bigger than the number of particles). One solution is to just go through all near pairs on one axis and measure their 3D distance. This does not scale very well, but because there is barely any branching here it does allow GPU acceleration (haven't tried it yet though).
Z-order curve: Instead of computing voxel coordinates I converted each point to a Z-order curve coordinate and again sorted the array. Then I compared near elements. It turns out this method catches only a fraction of all possible hits (maybe 40%, if I test up to elements that are 3 spaces apart) but this also depends on the implementation a lot (like the precision I use to convert from floating point to integers, and whether to interleave into 32 bits or 64 bits). This method scales really well though.
I'd be exited to learn about other methods to solve this problem!
The cell method uses a regular rectangular grid of cells, with all points within each cell stored in a per-cell array. In C99, the following structures can be used:
There is no lower bound for the number of cells. It is just that the cell method does not help at all if you only have a single cell.
In order to only need to scan 7 (23-1) or 26 (33-1) nearby cells, you want each side of each cell to be at least as long as your maximum interaction distance
rmax; i.e.rmax <= (xmax - xmin) / xcells,rmax <= (ymax - ymin) / ycells, andrmax <= (zmax - zmin) / zcells. So, there is an upper limit to how many cells you use.The optimum number of cells is a very complex question, depending on both the hardware and the simulation/calculations done on the particles. Personally, I'd let the user configure the number of cells as a ratio, and start with something like having one cell on average per 16 particles, but not more than allowed per
rmax.If the simulation volume is fixed, then the cells can stay fixed, too. Otherwise, you may need to re-cell the simulation.
Initially, and also when re-celling, you collect all atoms into a single array (I'd use a
struct cellfor the dynamically-allocated array), freeing all existing cells after copying their atoms if re-celling.If the simulation volume is not fixed, you do a pass over the atoms to find the minimum and maximum coordinates. You will need to increment the maximums, so that they are exclusive, not inclusive. I'd personally subtract
alpha*rmaxfrom the minimums and add the same to maximums, to allow a buffer around the simulation volume, withalphabeing a constant between 0.5 and 3.Next, you decide the number of cells along each dimension (
xcells,ycells, andzcells), and then allocate the array of cell pointers. Remember to check and not make the cells too small. You can always reduce the count safely; all it does is reduce efficiency.Next, you allocate a temporary
xcells*ycells*zcellsarray ofsize_t, in order to calculate the number of particles in each cell. You'll do that using a separate pass over the current particle data, using the temporary array as a histogram.If the particle coordinates are
x,y, andz, then the cell coordinates areHowever, if you have added at least a small buffer zone to the minimum coordinates, you should be able to safely precompute
and then use
for each particle. If the temporary array is named
count, then you'll do the count pass using essentiallyNote that if you combine the expressions, you must include casts:
After you have the counts, you can allocate the array for each cell. Keep empty cells as
NULL. Remember: if a cell hasnparticles, you must allocate room for(n | CELL_MASK)+1struct atom's. Initialize the non-NULLstruct cell's with.n = 0.Next, you copy each particle from the list to its proper cell, using the same cell coordinate formula you used for the counting, and using the cell
.nas the index to the cell's.a[]array. Post-increment the.nafter adding each particle.Finally, you can discard the single array containing the original particles.
If memory use is an issue, you can use a singly-linked list of
struct atomarrays for the single array, for examplewith each array (list node) being at least 132k (say, between 4500 and 8192
struct atom's perstruct atom_list). On most Linux architectures the C library then usesmmap()to allocate memory for each list node, and whenfree()d, they are immediately returned to the OS.When a particle moves, you need to check whether it stays within the same cell. The easiest is to recalculate the cell coordinates for the new particle coordinates, and the cell index (
ix+xcells*iy+xycells*iz)) for both old and new coordinates. If the cell index changes, the particle moves to another cell.In the new cell, you make sure there is enough room. (If
.n == .n | CELL_MASK, the cell is full, and you need to reallocate for((.n + 1) | CELL_MASK) + 1struct atom's.) You copy the old particle data to at the end of the array in the new cell, and increment the.n.In the old cell, if the particle was the last one (
.n == 1), free the cell, and set the pointer to NULL. Otherwise, move the last particle in the array on top of the moving particle, and decrement.n. You do not necessarily need to reallocate the array to smaller size (when decremented.nfulfills.n | CELL_MASK == .n), but if memory use is a concern, it is a good thing to do. (Note that in this case, the new size is.n+1struct atoms.)To find all neighbors for particle
iin cell(xi,yi,zi)(cell[xi+xcells*yi+xycells*zi].a[i]), you first check all the other particles in the same cell. Then, ifxi>0, you check the particles in cell(xi-1,yi,zi). Ifxi<xcells, you check the particles in cell(xi+1,yi,zi). And so on for all 26 cells.The cell method is often used to construct a neighbour list. A neighbour list is simply a list of particles (a list of particle indexes, with all particles listed consecutively in one huge array) that are close enough.
Sometimes, the neighbour list includes each pair of particles once. (That is, if particle B is listed as a neighbour for particle A, A is not listed for B.) This is useful for example for classical pair-type potentials, because the forces are symmetric with respect to the two atoms, so there is no need to compute it twice. In this case, looking at the particles in just seven neighboring cells (either positive or negative side) suffices.
If the average step length a particle takes is long, a neighbour list needs to be recalculated too often to be of much use. (It works best when all particles move a little at each time step; then, you can just add a maximum movement limit to
rmax, so you only need to recalculate the neighbour list when at least one of the particles moves more than half the movement limit from the position it had when the neighbour list was last generated. But, if each particle jumps a random distance, you'd need to recalculate the neighbour list before you'd have jumped all the atoms, so a neighbour list does not help much in long-jumps case.)Indeed, I assume a neighbour list approach won't be useful. This is why I "optimized" (algorithm-wise) the above examples for the cell-method-only use case.
Note that working with squared distances is much, much faster, if you do not need the Euclidean distances for anything. (The square root operation is slow.)
Vectorization for SSE2/AVX/AVX2 etc.
The
struct atomstructure is suboptimal, if one wishes to vectorize the code (that calculates the distances between particle and its neighbours). In fact, the optimum structure is one wherex[],y[],z[], andt[]were each a separate array, and suitably aligned for vector use. (Note that when reallocated (larger or smaller), the arrays typically move within the dynamically allocated memory chunk, too. Unfortunately,malloc()may not provide sufficiently-aligned memory; you'll often have to explicitly usealigned_alloc()orposix_memalign().)It is quite possible to implement the optimal structure using an opaque 64-bit type (since I deliberately chose 64-bit types for each field in a
struct atom) and some macros or inline functions to manage the accesses and reallocation. Personally, I tend to implement the "naïve" structures and functions as above, and the vectorized functions in a separate compilation unit, so that the two are interchangeable, and test with torture-tests to make sure they both yield the same results (within rounding error, if we consider distances or distances squared).A custom memory allocator that maintains the cell arrays only, is extremely useful. It can keep the memory use to a minimum (using a slab approach, or moving cell arrays to plug holes), with minimum runtime overhead.
I like the cell method, because the simple form itself is useful, and tunable (via the number of cells in the simulated volume). It can then be extended, trading code complexity for added speed/memory efficiency (assuming you make the appropriate choices). A vectorized implementation using a custom memory allocator for the cell particle data can be extremely efficient, and not suffer from data nonlocality issues (cache misses and latencies) that often slow down classical molecular dynamics simulations. (Those show up as significant drops in speed versus the number of particles, as the number of particles increases, because the CPU needs more time to access the data if it is scattered all over in memory. Here, the most often accessed data is always consecutive in memory.)