If I have a set of leaves, and a distance matrix, how can I determine there is no way to create a tree graph for the leaves?
Asked another way, is there always a possible tree graph for any distance matrix on leaves?
My intuition is that in some way fitting non-tree leaves to a tree will result in violating the triangle inequality, but I cannot figure out why.
Deliverable: a general procedure to either A) create a tree for every distance matrix, or B) identify a distance matrix that cannot realize a tree.
There's a couple of ways to ask the problem: you could ask that each edge in the tree have length $1$, or you could allow weights on the edges. These have different answers in a couple of cases: for example, the distance matrix where any two leaves should be at distance $1$ is not realizable in the first model (you'd have to build a complete graph) but it is realizable in the second (build a star in which every edge has weight $\frac12$). But the procedure to solve both versions is basically the same.
First, for an example of a distance matrix that's not realizable for the leaves of any tree: $$ \begin{bmatrix} 0 & 2 & 2 & 3 \\ 2 & 0 & 2 & 4 \\ 2 & 2 & 0 & 5 \\ 3 & 4 & 5 & 0 \end{bmatrix} $$ Here, the subtree containing vertices $1$, $2$, $3$, and the paths between them must consist of three paths from a root $r$ to each of these three vertices, with the $r,1$-path, the $r,2$-path, and the $r,3$-path having total weight $1$. The path from vertex $4$ to this subtree must meet it on some $r,i$-path for $i = 1, 2,3$. In that case, the distances from vertex $4$ to the other two leaves except for $i$ are the same. But in the distance matrix, all three distances from vertex $4$ are equal.
Let $D$ be the distance matrix, with $D_{ij}$ the required distance from leaf $i$ to leave $j$. If a tree realization of $D$ exists, we can construct it recursively: find a realization $T_{n-1}$ for the first $n-1$ leaves, then find a place to add on the $n^{\text{th}}$ leaf.
Here is how we can do that. If $v$ is a vertex of $T_{n-1}$ for leaves $1, \dots, n-1$, then we can consider the possibility that in the full tree $T_n$, the path from leaf $n$ to $T_{n-1}$ meets it at $v$. For this to happen and to agree with the matrix $D$, the length of that path should be $D_{1n} - d(1,v)$ (where $d(1,v)$ is the distance from $1$ to $v$ in $T_{n-1}$) but also $D_{2n} - d(2,v)$, $D_{3n} - d(3,v)$, and so on. If these differences all have the same (positive) value, then we can construct the new path by adding a path from $v$ to $n$ of that length.
In the weighted case, there is another way of adding leaf $n$ to $T_{n-1}$: we subdivide an edge of weight $w$ into two edges of weights $w_1, w_2$ where $w_1 + w_2 = w$, and add a path from $n$ to the vertex where those two edges meet. This is handled similarly: for an edge we're considering, make a hypothetical subdivision of this form, and compute the differences $D_{in} - d(i,v)$ in terms of $w_1$ and $w_2$. If these can be made equal by some valid choice of $w_1$ and $w_2$, do so.
The key point to make this algorithm work is that there is at most one valid choice of vertex $v$, and so there is at most one valid realization $T_i$ for $i=1, 2, \dots, n$. If there were ever multiple realizations, this strategy would be much less efficient, because we might have to backtrack to try all of them.
To prove this point, suppose that there were two ways to extend $T_{n-1}$ to include leaf $n$, and they involved attaching a path to vertices $v_1$ and $v_2$. We find leaves $i_1$ and $i_2$ such that going from $i_1$ to $i_2$ involves passing $v_1$ and then $v_2$. (To do this, let $i_1$ be a leaf in the component of $T_{n-1} - v_1$ not containing $v_2$, and vice versa.) Then:
So $v_1$ and $v_2$ can't both be valid attachment points, and therefore there can't be two valid attachment points. Either there is one, or there are none at all.