Isolating clusters in data

31 Views Asked by At

This is a problem that I keep encountering in one form or another every so often.

Given a large N dimension space with a sufficiently large (M) sample of points. [1] If I were to be asked to find a single point which best summarised the distribution of the sample, the criteria for the single point would be easy. Just find the point where the rms of the distances between the point and each sample point is minimal.

However what I need is a technique/algorithm assuming the sample was generated as the union of a set of samples (call them subsamples), each sample clustered around a point. I then need someway to find the best set of subsamples and points to match the data.

In other words, I have a sample that is a collection of subsamples associated with a point and I need to identify the number of associated points, the actual points and seperate the sample into subsamples.

Any suggestions on how to do this.

PS I left some things vague purposely to make it as easy to answer as possible. Some constraints could make this problem a lot hard then it already is. (As in NP hard. )

[1] Though not large compared to the size of the original space. If we are talking for example of a lattice of a billion points, then I would expect a sample between 100000 and one million points.