Probabilistic subsampling of an Erdős–Rényi graph

296 Views Asked by At

Suppose I have an Erdős–Rényi graph ${\cal G}(n,p)$, where $n$ is the total number of nodes and $p$ is the probability of an edge between any pair of nodes (edges are added independently). I subsample the graph by sampling $m$ nodes from the graph with probability proportional to the node degrees. Specifically, the probability of sampling node $i$ is $q_i = \frac{d_i}{\sum_{j=1}^n d_j}$, where $d_i, i=1, \ldots, n$ is the node degree for node $i$. We obtain a subgraph ${\cal G}_s$ consisting of the sampled nodes and whatever edges exist between them in the original graph. Can one say something concrete about the underlying distribution of such subgraphs, i.e., subgraphs obtained from by subsampling Erdős–Rényi graphs through the described subsampling process? I presume one may be able to write down an expression for $p_s$ (the probability that there exists an edge between a pair of nodes in the subgraph), but are the subgraphs still Erdős–Rényi? I believe the independence of edges between the nodes may be affected. If they are no longer Erdős–Rényi, then what is the underlying probabilistic model of such graphs? This seems like a natural question, but has anyone investigated similar questions before? Any pointers or references would help.

1

There are 1 best solutions below

6
On

The distribution of the subgraphs is different from $\mathcal G(n,p)$. One major issue is that you're much more likely to get "clone" vertices.

Consider $p=\frac12$, say. Here, for any $\epsilon>0$, with probability tending to $1$, all degrees are between $(1-\epsilon)\frac n2$ and $(1+\epsilon)\frac n2$. This means that $q_i$ stays between $(1-\epsilon)\frac1n$ and $(1+\epsilon)\frac1n$. So for any pair of vertices in the sampled graph $\mathcal G_s$, the pair of vertices they represent from the original random graph is almost equally likely (within a factor of $(1+\epsilon)^2$) to be any pair. In particular, $p_s$ is between $(1-2\epsilon)\frac12$ and $(1+2\epsilon)\frac12$.

In an $m$-vertex Erdős–Rényi graph with edge probability around $\frac12$, the probability that two vertices $v_i$ and $v_j$ have exactly the same set of adjacent vertices is around $2^{-m}$, and there are almost never any pairs with this property. In $\mathcal G_s$, this probability is at least $\frac1n$: a lower bound on the chance that the two vertices came from the same vertex of the original random graph. By the birthday paradox, if $m \gg \sqrt n$, you are going to get many such pairs.

In the case of sampling without replacement, the graph is very similar to an Erdős–Rényi graph when $p$ is large, by the same argument: we are not too much more likely to choose any vertex than any other. If we chose a random set of vertices $S$, we'd get an Erdős–Rényi graph on $S$, and our actual choice of $S$ is only slightly biased.

When $p$ is small, things look very different. For example, if $p = \frac1n$, then the average degree is $1$, but the highest degree is around $\log n$, and there are about $e^{-1}n$ vertices with degree $0$. So when sampling from this graph, we are going to pick up all or almost all of the vertices with high degrees, and none of the isolated vertices. (Of course, their degrees in $\mathcal G_s$ will be different.)

For an extreme case where this makes a huge difference, suppose we take $m = (1-e^{-1}-\epsilon)n$ for some $\epsilon>0$. Then $\mathcal G_s$ picks up almost all of the edges of the original graph, so it should be similar to a $\mathcal G(m, \frac cm)$ Erdős–Rényi graph, for $c = \frac{1}{1-e^{-1}}$. But compared to that graph, $\mathcal G_s$ has almost no isolated vertices: only $O(\epsilon n)$ of them.

I don't think there is a simple description of this random graph beyond the way you're constructing it (by taking a random graph and subsampling).