Finding all repeating points in a large set

90 Views Asked by At

This is a programming question, but I believe there should be a related mathematical theorem, so sorry if it is not broad enough.

Given a large set of points $(X,Y)$ where $1 \le X \le 2000000$ and $1 \le Y \le 2000000$. How to find point or points that repeat $>2$ times. Without writing down every point in the set to mark down occurrences.

All set points are randomly distributed, except points that repeat, those occur multiple times in the set.

Normally it would be easy to spot those points putting each of them on the graph. However, since the set is very large and no visualization can be done I'm looking for a way to approximate a number of occurrences with some sort of reduction algorithm. So any related mathematical problem would be helpful.

Simplified question: Given a very list of integers $N$ where $1 \le N \le 2000000$ how to find one or more integers that are probably (with a confidence of $\ge90$%) most dominant in the list, without counting all occurrences of every number.

Suggested solution: if anyone stumbles onto similar problem - you can lookup probabilistic data structures for information how to deal with problems like this. Also this problem can be solved in naive way by the answer below.

1

There are 1 best solutions below

8
On BEST ANSWER

I think the best you can do (with two or three passes) is to create a number of "bins" that cover your set of points. For example, we might think in terms of the original universe $[1,\ldots,20000000]\times [1,\ldots,20000000]$ being subdivided into subsets of a convenient form.

Ideally your 1Tbyte of data would be partitioned into (say) one hundred 10Gbyte subsets. If the subsets are defined by (say) cartesian bounds $S_{i,j} = [x_i,x_{i+1})\times [y_j,y_{j+1})$, then any repetitions that occur will necessarily occur within one of these subsets.

Now each subset is constructed to fit within your PC's memory, where an efficient sorting algorithm can identify all the repeated elements.

If the points are known to be distributed more or less uniformly throught the "universe", something like the optimal partition of the universe will be easily precomputed. If the distribution is not uniform and not much information is available apriori about the distribution, then an approach in which we adaptively define the partition might be attractive.

Many ways of partitioning the universe will serve our purpose, as the only real requirement is that repeated points will be assigned to the same bin.