Overlap between the maximum spanning tree and largest edges of a bipartite graph

192 Views Asked by At

Given $K_{m,n}$ a bipartite graph, what is the expected fraction of overlap that the set of the top/largest m+n-1 edges would have with the maximum spanning tree (MST) of the graph? Naively it has to be greater than $\frac{1}{m+n-1}$ since the largest edge of the graph must be a part of both sets.

Likewise, are there any known $(\varepsilon,\delta)$ bounds on the overlap? I.e., any identities stating how large the set of top-$k$ edges would need to be to ensure that its fraction of overlap with the MST is greater than some $\varepsilon$ with a probability of $1-\delta$.

The cases I would be most interested in are if the edge weights are sampled from any continuous distribution.

1

There are 1 best solutions below

0
On

Let's take Kruskal's algorithm, one of the algorithms for a maximum spanning tree. This considers the edges in order from largest to smallest weight, and for every edge, it adds it to the spanning tree provided this would not create a cycle.

Thinking about the maximum spanning tree of the randomly weighted graph in this way, the following questions are equivalent:

  • How many of the $k$ highest-weight edges of $G$ end up in the maximum spanning tree?
  • If we choose $k$ random edges of $G$, how many edges are in the largest acyclic graph we can form from them?

Rather than consider a complete bipartite graph, I will look at just a complete graph first, because this case is more well-studied: it is the Erdős–Rényi model of random graphs, where we take $n$ vertices and choose $m$ random edges between them. What is known is that:

  • Suppose we choose $m = cn$ edges for $c < \frac12$. Then with high probability they only form a constant number of cycles.
  • When $c>\frac12$, the graph undergoes a phase transition. The $m$ edges create a giant component with a constant fraction of the vertices, and many cycles. The remaining components are small, and mostly have no cycles, though a few have one. By a Poisson approximation, there are around $e^{-2c}n$ isolated vertices.
  • The graph becomes connected around $\frac12 n \ln n$ edges, when the last isolated vertex disappears.

In the maximum spanning tree world, this says:

  • The top $\frac12 n$ edges are almost all in the maximum spanning tree.
  • Of the top $n$ edges, some constant fraction is in the spanning tree (more than $\frac12$, by monotonicity, but less than $1 - e^{-2}$, because of the isolated vertices, which are part of the problem but not all of it). The exact constant is annoying.
  • The top $\frac12 n \ln n$ edges contain almost all the edges of the maximum spanning tree.

If instead of a complete graph, we are weighting the edges of a complete bipartite graph, the general picture stays the same but the constants vary. I'll do a couple of examples; you can interpolate.

First suppose our host graph is $K_{n,2n}$. Let's switch to the model where we pick each edge independently with probability $p = \frac cn$. Then the expected number of cycles of length $2k$ is roughly $\frac1k n^k (2n)^k p^{2k} = \frac{(2c^2)^k}{k}$, and the sum $\sum_{k \ge 2}\frac{(2c^2)^k}{k}$ converges for $c < \frac{\sqrt2}{2}$. So the phase transition happens at $p = \frac{\sqrt 2}{2n}$, when we pick $\sqrt 2 n$ random edges.

Meanwhile, the probability that a vertex is isolated is $e^{-2np}$ for a vertex on the smaller side and $e^{-np}$ for a vertex on the larger side. When $p = \frac{c \log n}{n}$, this gives $n^{1-2c}$ isolated vertices on the smaller side and $2n^{1-c}$ isolated vertices on the larger side, so $c=1$ is the point at which almost all the isolated vertices on the larger side have disappeared. At this point, we have $2n \log n$ edges.

In summary:

  • Of the top $\sqrt 2 n$ edges, almost all get used;
  • Of the top $3n$ edges, some constant fraction get used;
  • Almost all edges of the spanning tree are in the top $2n \log n$ edges.

It's also informative to consider the host graph $K_{10,n}$ where $10$ is fixed and $n$ is large. Here, we can be more precise. The vertices on the small side take a constant number of edges to connect, so we can ignore them. We just want to know: how many vertices on the large side are isolated?

By a Poisson approximation, the average number of such vertices after we pick $k$ edges is $n e^{-k/n}$. Therefore after $cn$ edges, there are $e^{-c}n$ isolated vertices, so approximately $(1 - e^{-c})n$ of the top $cn$ edges are part of the spanning tree.