Given $K_{m,n}$ a bipartite graph, what is the expected fraction of overlap that the set of the top/largest m+n-1 edges would have with the maximum spanning tree (MST) of the graph? Naively it has to be greater than $\frac{1}{m+n-1}$ since the largest edge of the graph must be a part of both sets.
Likewise, are there any known $(\varepsilon,\delta)$ bounds on the overlap? I.e., any identities stating how large the set of top-$k$ edges would need to be to ensure that its fraction of overlap with the MST is greater than some $\varepsilon$ with a probability of $1-\delta$.
The cases I would be most interested in are if the edge weights are sampled from any continuous distribution.
Let's take Kruskal's algorithm, one of the algorithms for a maximum spanning tree. This considers the edges in order from largest to smallest weight, and for every edge, it adds it to the spanning tree provided this would not create a cycle.
Thinking about the maximum spanning tree of the randomly weighted graph in this way, the following questions are equivalent:
Rather than consider a complete bipartite graph, I will look at just a complete graph first, because this case is more well-studied: it is the Erdős–Rényi model of random graphs, where we take $n$ vertices and choose $m$ random edges between them. What is known is that:
In the maximum spanning tree world, this says:
If instead of a complete graph, we are weighting the edges of a complete bipartite graph, the general picture stays the same but the constants vary. I'll do a couple of examples; you can interpolate.
First suppose our host graph is $K_{n,2n}$. Let's switch to the model where we pick each edge independently with probability $p = \frac cn$. Then the expected number of cycles of length $2k$ is roughly $\frac1k n^k (2n)^k p^{2k} = \frac{(2c^2)^k}{k}$, and the sum $\sum_{k \ge 2}\frac{(2c^2)^k}{k}$ converges for $c < \frac{\sqrt2}{2}$. So the phase transition happens at $p = \frac{\sqrt 2}{2n}$, when we pick $\sqrt 2 n$ random edges.
Meanwhile, the probability that a vertex is isolated is $e^{-2np}$ for a vertex on the smaller side and $e^{-np}$ for a vertex on the larger side. When $p = \frac{c \log n}{n}$, this gives $n^{1-2c}$ isolated vertices on the smaller side and $2n^{1-c}$ isolated vertices on the larger side, so $c=1$ is the point at which almost all the isolated vertices on the larger side have disappeared. At this point, we have $2n \log n$ edges.
In summary:
It's also informative to consider the host graph $K_{10,n}$ where $10$ is fixed and $n$ is large. Here, we can be more precise. The vertices on the small side take a constant number of edges to connect, so we can ignore them. We just want to know: how many vertices on the large side are isolated?
By a Poisson approximation, the average number of such vertices after we pick $k$ edges is $n e^{-k/n}$. Therefore after $cn$ edges, there are $e^{-c}n$ isolated vertices, so approximately $(1 - e^{-c})n$ of the top $cn$ edges are part of the spanning tree.