Imagine this... Batman has just retrieved a tracking device he placed on The Joker 150 days ago. The good news is that it has 150 coordinates — one from each day. The bad news is that all the data is randomly sorted — there's no way to tell when the coordinates were recorded, nor their sequence. Further, all the data was collected at random times during the day so we can't even be sure any of the points were actually taken at the hideout — it might very well be in between some of them. How can we help Batman find the secret hideout?
Here's a map of the dataset: http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179
Here's a pastebin of raw 150 geocodes: http://pastebin.com/grVsbgL9
In math terms, I'm looking for help identifying the centroid of a complex cluster of data. As you'll notice in this data set, there are several clusters (San Francisco, LA, Chicago and NYC) along with lots of noise throughout the rest. I need to determine which cluster is primary, and identify the centroid of this cluster.
Can you recommend a strategy? Preferably one with some meat I can use to begin analyzing the data for the "secret hideout"? ;)
Here's a heuristic that has no scientific basis whatsoever (as far as I know). It's virtue is that it's easy to program.
Let $d_{ij}$ be the distance from point $P_i$ to point $P_j$, $(1 \le i,j \le n)$.
(1) Compute the average $d$ of all the $n^2$ $d_{ij}$ values.
(2) Choose some factor $k$; I'd suggest around 0.1, but you can experiment.
(3) Let $r=k*d$ be a "threshold" radius.
(4) For each $i$, find the count $c_i$ of other points that are within a distance $r$ from point $P_i$.
(5) Any point $P_i$ that has high value for $c_i$ is a good candidate for the hide-out, because it has lots of other points nearby.
If you think you can guess a good value for $r$, then you can skip steps (1) and (2).