Sampling variance of edge density of subgraphs

30 Views Asked by At

I would like to evaluate the mean and variance of the edge density for subgraphs obtained by repeatedly subsampling nodes.

Specifically, suppose we have an undirected graph $G$ with $N$ vertices and adjacency matrix $y$. We obtain a subset of nodes $S$ by Bernoulli sampling, i.e., for each node $i$ in $G$, we sample an inclusion indicator $z_i\sim\mathsf{Bernoulli}(\rho)$ and add $i$ to $S$ if $z_i=1$. $\rho$ is the probability for a given node to be included in the subgraph, and $n=\sum_{i=1}^N z_i$ is the number of nodes of the subgraph. Then the edge density of the subgraph $G'$ is $$ \lambda=\frac{z^\intercal y z}{n(n-1)}, $$ where $z^\intercal$ denotes the transpose of $z$. Following Strand (1979), we let $\lambda=0$ for $n<2$. What are the mean and variance of $\lambda$ under the sampling scheme?

Background Reading

Frank (1977) (I couldn't find an open access link, unfortunately) derives the sampling variance for an estimator of the total number of edges in the original graph defined as $$\begin{aligned} \hat T&=\rho^{-2}\sum_{(i,j)\in S^2} y_{ij}\\ &=\frac{z^\intercal y z}{\rho^2} \end{aligned}$$ which is almost what I want. However, in the definition of $\lambda$, I divide by $n\sim\mathsf{Binomial}(N, \rho)$ rather than $\mathbb{E}[n]=\rho N$. If $N$ is sufficiently large and $\rho$ is moderate, this doesn't make much of a difference, but I have both small $N$ and $\rho$ close to 1.

Strand (1979) derives the expectation and variance for a simple sample mean under Bernoulli sampling. I.e., they consider the mean and variance of $$ \bar x = \frac{z^\intercal x}{\sum_{i=1}^nz_i}, $$ where $x$ is a vector of attributes for each node. This is, again, almost what I want. But the cross terms of $z^\intercal$ and $z$ in the definition of $\lambda$ mean that I can't use the expressions directly.

Any ideas much appreciated!

1

There are 1 best solutions below

0
On

Here is a partial answer that only addresses the mean.

We can interpret $\lambda = \frac{z^{\mathsf T}yz}{n(n-1)}$ as the sum of sort-of-indicator variables $\lambda_{ij}$, for every $ij \in E(G)$, where $$\lambda_{ij} = \begin{cases}\frac2{k(k-1)} & \text{if $i$, $j$, and $k-2$ other nodes are sampled} \\ 0 & \text{otherwise.}\end{cases}$$ Then $\lambda_{ij} = \frac2{k(k-1)}$ with probability $\binom{N-2}{k-2} \rho^k (1-\rho)^{N-k}$, so its expected value is $$ \mathbb E[\lambda_{ij}] = \sum_{k=2}^N \frac2{k(k-1)} \binom{N-2}{k-2} \rho^k (1-\rho)^{N-k}. $$ Using the identity $\binom Nk = \frac{N(N-1)}{k(k-1)}\binom{N-2}{k-2}$, we can rewrite this as $$ \mathbb E[\lambda_{ij}] = \frac2{N(N-1)} \sum_{k=2}^N \binom Nk \rho^k (1-\rho)^{N-k} = \frac2{N(N-1)}\Pr[n \ge 2]. $$ Altogether, we get $\mathbb E[\lambda] = \frac{|E(G)|}{\binom N2} \Pr[n\ge 2]$, which is combinatorially nice: it's the edge density of $G$, multiplied by the probability that $\lambda$ has a sensible denominator.