I am testing a clustering algorithm in high dimensions. I want to see how it behaves as I allow the clusters to get closer and closer, but it must work perfectly for "well separated" clusters. I need to know how far apart to make those well-separated clusters, a minimum distance L.
My data consists of randomly generated clusters where each point in the cluster equals the center of the cluster plus a randomly generated noise vector. The noise has a Gaussian distribution where each coordinate can vary independently from the others.
If the standard deviation of the variation of each coordinate is S, the number of dimensions is D, and the number of points is N, at what distance L from the center is the farthest point likely to be? If I know that, then the minimum separation between the centers of the clusters that guarantees no overlap is 2*L.
Say I have one dimension. If N = 147,160 points, then I can expect 1 point to be at a distance of 4.5 * S (four-and-a-half standard deviations). For D dimensions, at what factor times the standard deviation am I likely to find the farthest point? I know from this article https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/chap1-high-dim-space.pdf that the expected value of the radius is
R = S * sqrt(D)
so it must be farther than that.)
What I need is:
L = f(N,S,D)
Experimentally (from a C# program), here is an example:
N = 53,723 points
S = 30
D = 50 dimensions
L = 262.02 (farthest point from center of any cluster)