Sorting large vector by distance

131 Views Asked by At

Lots of dating apps sort they results by distance. That's it, you're at position $(x, y)$ at the Earth map and the application shows you up to, let's say, $k$ potential partner. Those are not any people, but the $k$-nearest to you.

If I've got a vector $v$ with $n$ pairs $(x_i, y_i)$ indicating people's position, calculating the distance from my position $(x, y)$ to each position $(x_i, y_i)$ would be $\Theta(n)$, and selecting the $k$ smallest would probably be also $\Theta(n)$ (in mean in running time, probably using min heaps or something alike).

The problem is that $n$ may be huge (imagine how many users do use Tinder, Grindr and applications like that).

So, my question is, how do they do it so fast?

2

There are 2 best solutions below

0
On BEST ANSWER

As was hinted in the comment, clustering is the key. One possible scenario could be the following:

When you start your app, the app puts you into a bucket (or bin, or category, or whatever you want to call it). The bucket gives a hint to your location, e.g. your country, city, etc. This happens to all users, so everyone is already pre-sorted into a bucket (clustering) and this must not happen during the actual search. Now when there are $N$ buckets with on average $\bar n$ inhabitants, then you can first look for the $K<<N$ nearest buckets to your bucket, e.g. your own city and the nearest ones, and then you only look at those $\bar n K <<\bar nN=n$ nodes in your distance graph. If this number is still too large (e.g. a very big city) you might consider choosing smaller buckets or you cluster the buckets too, e.g. city districts.

Finding the $K$ nearest buckets might take $\Theta(N)$ time, but since the buckets represent cities or other immutable entities, you will calculate this only once. So it is not part of the interactive search. Now your search takes $\Theta(\bar nK)$ steps which is hopefully much smaller than $\Theta (n)= \Theta(\bar n N)$.

2
On

One data structure very applicable to this kind of problem is the k-d tree. (I think the wiki article explains it well enough, but comment if you have questions). If you store your positional data in a k-d tree, a nearest neighbor search becomes $O(\log(n))$.

I'm not sure if this is precisely what dating apps use (they probably just cluster by city where $n_{\text{city}}$ is "small enough" to brute search), but k-d trees are common in the field of motion-planning.