Arrange points in a high-dimensional space when you only have distances (no coordinates)

130 Views Asked by At

I'm working on an algorithm to determine how similar two files are. You feed $2$ files into the algorithm and the algorithm returns a natural number. If this number is $0$, the files are identical. If they have similar contents, the number is small, if they are different, the result is a larger number (of the order of the length of the larger file).

I want to use this algorithm to compare many files (e.g. thousands of log files) to group them into clusters based on their similarities. But most clustering algorithms work with the coordinates of points in a multidimensional space. But I do not have coordinates. I have distances only.


1st question:
If the triangle inequality applies for each triple of distances (Let $a,b,c$ be the three distances, $a \le c \land b \le c$, then $a+b \ge c$): Is it then guaranteed that the $n$ points between which the distances are known can always be arranged in a ($n-1$)-dimensional space?

I would use the following algorithm:

  1. Sort the files (points) in any order, so you have Points $p_1, p_2, p_3$ etc.
  2. Set the point $p_1$ to the coordinate origin
  3. Set the point $p_2$ on the positive $x_1$ axis, corresponding to the distance $p_1-p_2$
  4. Place point $p_3$ in the plane defined by axes $x_1$ and $x_2$ so that the distances to the points $p_1$ and $p_2$ are correct and $p_3$ has a non-negative value for its $x_2$ coordinate.
  5. Continue with all further points so that the new point correctly takes all distances to the previous points and that it has a non-negative value for the newly added coordinate.

With my imagination I can understand that I can always use this algorithm to insert $4$ points with the correct distances into a $3$-dimensional space, and my gut feeling tells me that this is also true for all higher dimensions, but gut feeling is not a proof.
Is my assumption correct?


2nd question:
Same problem, but now the triangle inequality is violated. For example, among many thousands of points there are 3 points $A, B, C$ with these distances: $AB = 1, BC = 1000, AC = 1100$. (These are rounded results of a real test run.)

What possibilities do you have to arrange these points in a high-dimensional space? (The purpose is still to find clusters. The purpose is not accurate positioning.)


Addendum

I just posted an answer to my own question in which I showed that my assumption from question 1 was wrong.

But question 2 is still left unanswered, so let me ask it again with other words:

Given a complete graph with $n$ vertices and weights for all of its $\frac{n^2-n}{2}$ edges: What is the best way to embed this graph in an Euclidean space with $n-1$ dimensions, so that each vertex becomes a point and each pair of $2$ points has a distance as close as possible to the weight of the edge that connects both vertices?

1

There are 1 best solutions below

0
On

My assumption was wrong.

If you have a complete graph with more than 3 vertices, with positive weights for each edge, and if the triangle inequality applies for each triple of edges, then it is not always possible to embed this graph in Euclidean space, so that each vertex is a point and the weight of each edge is a distance.

Let $A,B,C,D$ be the vertices of the graph (points in Euclidean space) with these weights (distances):

$$ \overline{AB} = 5\\ \overline{AC} = 8\\ \overline{AD} = 5\\ \overline{BC} = 5\\ \overline{BD} = 8\\ \overline{CD} = 5\\ $$ Since the largest distance is less than twice as large as the smallest distance, the triangle inequality applies to all 4 triangles $ABC, ABD, ACD, BCD$. Let's begin to create the triangles $ABC$ and $ACD$. They are equal and they have a common edge ($AC$). The height of both triangles over their common edge is $3$. If you embed those triangles in $3D$ space (or any Euclidean space with more dimensions), the points $B$ and $D$ can freely rotate around the side $AC$ as long as they aren't connected. But their distance can never be greater that the sum of both heights, which is $6$. But $\overline{BD}$ should be $8$.

So, it is not possible to embed this graph with its weights in Euclidean space. Any larger complete graph might contain such an impossible tetrahedron as described here, and therefore for all complete graphs with more than 3 edges the triangle inequality is a criterion that is to weak to guarantee that the graph might be embedded in Euclidean space.

qed.