Finding center of larger cloud of points?

865 Views Asked by At

First of all, excuse the abstract wording of the title. I don't know the proper term for the concept I want to find, which is part of my problem I guess.

I've got a cloud of points (in my case it's 3d points, but it shouldn't matter). They are placed in such a way that most of them are in a general area, but with groups of outliers. Take this picture as an example:

enter image description here

I would like to find a value that gives me the center point of the big cloud of points, so to speak. That is, an equivalent of the average that isn't affected by the outliers. But there is an important detail: the bigger cloud should be defined as the larger cloud in size, not the cloud with most points. In the sample image, for example, it is possible that one of the clouds in the lower right has a larger number of points than the large cloud in the left, yet the one in the left should be the chosen cloud.

The problem is that I don't know first hand how many outliers there will be or how far they are going to be from each other, so it's hard to define which points will "form a cloud".

The reason for my question is that I'm basically trying to center a 3d model in a computer application. Using just the denser area isn't valid, since it could represent just a small object that's modeled with high accuracy due to being round.

1

There are 1 best solutions below

0
On

Here is how I would group the points. Note that for $N$ points, this has $O(N)$ memory complexity; the time complexity depends on the implementation, but might be as bad as $O(N^2)$.

  1. Find the distance to the nearest other point for each point.

    This is the well-known nearest neighbor search problem. In this particular case, we essentially need the distance to the nearest neighbor for each point.

    One way to solve this is to spatially partition the volume the points reside in, using a regular cubic lattice, and arrange the data so that all points within a specific lattice cell are continuous in memory. That way it suffices to check point pairs within the same cell, or in neighboring cells.

    (Note that just because point $i$ is closest to point $j$, does not mean $j$ is closest to $i$ too. Think of a long string of points, with distances slowly increasing, and you'll see why: each point is closest to the preceding point. Not realizing this is a common source of bugs.)
     

  2. Find the minimum distance $d$ at which points are no longer considered to belong to the same group.

    If you have $N$ points, and you assume there are $K$ solitary points, then $d$ is the $N-K$'th smallest nearest neighbor distance. (You can either sort the point, or use a selection algorithm to find $d$ in the data found in the previous step.)

    If you do not know $K$, examine the nearest neighbor distances. At distance $d$, there is a rapid discontinuity to much larger distances. Examining real-world data will typically tell you which methods you can use in examining the nearest neighbor distances to find the discontinuity distance $d$.

    A simpler way is to just choose a $\lambda$, $0 \lt \lambda \le 1$, and assume that at $K \lt N (1 - \lambda)$, that at least fraction $\lambda$ of all points are in some group, and not solitary. Since $\lambda$ is a guess, you normally use another factor $\gamma$ slightly larger than 1, so that $d = \gamma \, d_{\lceil \lambda \, N \rceil}$, where $d_n$ denotes the $n$'th smallest nearest neighbor distance. Essentially, you pick the $\lambda N$'th nearest neighbour distance, and multiply it by $\gamma$, to find the group-member limit distance $d$. Personally, I'd start with $\lambda = 0.75$ and $\gamma = 1.25$, but let the user override the settings (either $\lambda$ and $\gamma$ used to find $d$, or $d$ directly) in the program configuration; this is quite simple but robust method, but mathematically it is very much just handwaving.
     

  3. Create a disjoint-set data structure to represent the point groups.

    Initially, each point is in their own group. All points that are within distance $d$ are merged (using the union disjoint-set operation; do apply path compression for efficiency), because they belong to the same group.

    If you have the points spatially partitioned (see step 1), so that each cell is at least $d \times d \times d$ in size or larger, you only need to examine the points within the same cell, and points in neighboring cells, to see if they should be joined.
     

  4. Apply full path compression to the disjoint-set data structure, so that points belonging to the same group will refer to the same root element.

    In other words, your disjoint-set data structure will have exactly one root element per point group. You can, in $O(N)$ time complexity, count the number of groups, and the number of points in each group, using the flattened (path-compressed) disjoint-set data structure.
     

  5. Profit!

    Err.. That is, now you know which point belongs to which group, how many points are in each group, and how many groups there are.

    If you selected $d$ too small, you'll have lots of groups near each other, and most likely a lot of solitary groups (with just one point). If you selected $d$ too large, you'll have a single group covering the entire set.

I have not implemented the above for computer graphics, but I do deal with molecular dynamics simulations using classical potential models (non-quantum mechanic models), where one interesting sub-problem is cluster detection, which is similar.

If exact results are not needed, you can use a sub-volume for the first two steps; that is, only consider a subset of the points. Do note that you must pick them spatially (as opposed to, say, just considering some first or random points), so that you get a representative subset of the nearest neighbor distances. The third and fourth step must be applied to all points, but they can be implemented in $O(N)$ time complexity in practice.