An ER network is a graph $G=(V,E)=\mathcal{G} (n, p)$, where there're $n$ nodes and for each two nodes $i,j\in V$, the edge $(i,j)$ has the probability $p$ of being present in $E$ and $(1-p)$ of being absent, independently of the other edges. We define that two different nodes $u$ and $v$ in network $G$ are 2-hop neighbors if and only if their shortest distance in $G$ is exactly 2. Provided that $p\ll 1 \ll np$. The question is, to prove the summation of expected number of 2-hop neighbors for all nodes in network $G$ can be approximated by $n^3p^2$.
Denote graph $G=(V,E)$, probability that node $i,j$ is connected as $P_{i,j}$, a node $i$'s $m$-hop neighbors set as $V_i^{(m)}$. First I'll show two seemingly right proofs leading to the expected result, which I think, however, is wrong. And then I'll put some of my thoughts about why there're mistakes.
- If two nodes $i$ and $j$ are 2-hop neighbors, there is a central node $k$ s.t. edge $(i,k)\in E$, $(k,j)\in E$ and edge $(i,j)\notin E$. To count pairs of 2-hop neighbors is exactly to count such central nodes. For each node $k$, the expected number of its 1-hop neighbors is $$ \mu(|V_k^{(1)}|)=\sum_{i\in V\backslash \{k\}}P_{ik}=(n-1)p\stackrel{np\gg p,\text{ so } n\gg 1}{\approx} np $$ Among the $np$ nodes there're ${np\choose 2}$ pairs of nodes. For each pair of nodes, that they're 2-hop is equivalent to that they're disconnected, with probability $(1-p)$. Thus the expected number of 2-hop neighbors (counted by nodes, 2$\times$pairs) is $$ 2\sum_{i,j\in V_k^{(1)}}(1-P_{ij})=2\frac{np(np-1)}{2}(1-p)\stackrel{p\ll 1 \ll np}{\approx} n^2p^2 $$ Globally each node of $V$ can be $k$, thus the summation of expected number of 2-hop neighbors for all nodes in network $G$ can be approximated by $n*n^2p^2=n^3p^2$.□
- Denote a 2-hop path as $(i,k,j)$. We have $k\in V_i^{(1)}$, $j\in V\backslash \{V_i^{(1)}\cup\{i\}\}$, and $(k,j)\in E$. Thus the expected number of such $(k,j)$ pairs is $$ \sum_{k\in V_i^{(1)}}\sum_{j\in V\backslash \{V_i^{(1)}\cup\{i\}\}}P_{kj}=np(n-np-1)p\stackrel{p\ll 1 \ll np}{\approx}n^2p^2. $$ Globally each node of $V$ can be $i$, thus the summation of expected number of 2-hop neighbors for all nodes in network $G$ can be approximated by $n*n^2p^2=n^3p^2$.□
However, it has been counted repeatedly in the above two proofs:
2-hop neighbors should've been counted in pairs (numbers of 2-hop paths) rather than in nodes (2$\times$numbers of 2-hop paths). Otherwise, e.g. $a,b$ and $b,c$ are both 2-hop neighbors. It's clearer to say "there're 2 pairs of neighbors" than to say "there're 4 neighbor nodes", with just 3 nodes in total.
Central nodes are also counted repeatedly. E.g. for a pair of 2-hop neighbors $i,j$, their 2-hop paths are $(i,k_1,j)$, $(i,k_2,j)$, ... ,$(i,k_m,j)$. $i,j$ should only be counted twice, but here they're counted $m$ times.
In fact, the expected number of all edges in $G$ is only $\mu(|E|)={n\choose 2}*p=\frac{n(n-1)}{2}p\approx \frac{n^2p}{2}$. Since $1\ll np$, $\frac{n^2p}{2}\ll n^3p^2$: how would 2-hop neighbors be far more than total edges?
One of my idea: randomly pick a pair of nodes $i,j\in V$, $$ \begin{aligned} Pr\{i,j\ are\ 2-hop\}&=Pr\{(i,j)\notin E\}*Pr\{\exists k, (i,k),(k,j)\in E\}\\ &=Pr\{(i,j)\notin E\}*(1-Pr\{\forall k, \mathbb{1}((i,k)\in E)*\mathbb{1}((k,j)\in E)=0\}\\ &=(1-p)*(1-(1-p^2)^{n-2}) \end{aligned} $$ but I don't know how to expand it to a well estimated form.
So I'm here to ask, is $n^3p^2$ the right answer? If yes, how to explain the mistakes I listed above and how to prove? If not, what's the exact answer?
Any possible help would be appreciated!
There's no reason why the number of 2-hop neighbors can't be much larger than the number of edges. For example, in a star graph ($1$ node connected to $k$ others), the number of edges is $k$, and the number of 2-hop neighbor pairs is $\binom k2$.
However, the answer of $n^3 p^2$ is only valid when $p$ is not too large. Specifically, we will want $np^2 \ll 1$, or $p \ll \frac1{\sqrt n}$. If $np^2 \gg 1$, then $n^3 p^2 \gg n^2$, so there would be more than $n^2$ 2-hop neighbor pairs, which is nonsense. The intermediate case where $p \sim \frac{c}{\sqrt n}$ also has different behavior: here, a constant fraction of the pairs of vertices are 2-hop neighbors.
Your final approach where we pick one of the $\binom n2$ pairs and estimate the probability that they form a $2$-hop neighbor pair is, I think, the easiest one conceptually, even if the asymptotics are tricky.
To understand the probability $p^* = (1-p)(1 - (1 - p^2)^{n-2})$, let's:
For $p \ll \frac1{\sqrt n}$, we now want to use the inequality $1 - \binom n1 p^2 \le (1 - p^2)^n \le 1 - \binom n1 p^2 + \binom n2 p^4$. Where does this come from? It's taking the first two and first three terms of the binomial expansion of $(1-x)^n$ as lower and upper bounds, which is valid by inclusion-exclusion. Therefore $$ np^2 - \frac12 n^2 p^4 \lesssim p^* \lesssim np^2. $$ However, $np^2 - \frac12 n^2p^4 = np^2 \left(1 - \frac12 np^2\right)$. We are assuming $np^2 \ll 1$, so $1 - \frac12 np^2 \sim 1$, and we have $p^* \sim np^2$.
There are $\binom n2 \sim \frac12 n^2$ pairs of vertices which can be 2-hop neighbors, so the expected number of 2-hop neighbors is $\binom n2 p^* \sim \frac12 n^3p^2$. This doubles, becoming $n^3 p^2$, if you want to count the pair $(v,w)$ and the pair $(w,v)$ as different.
For $p = \frac{c}{\sqrt n}$, $(1 - p^2)^n = (1 - \frac{c^2}{n})^n \sim e^{-c^2}$, so $p^* = 1 - e^{-c^2}$ and there are $\sim \binom n2 (1 - e^{-c^2})$ $2$-hop neighbors. By monotonicity, this is also the estimate when $p \sim \frac{c}{\sqrt n}$.
Finally, when $p \gg \frac1{\sqrt n}$ but still $p \ll 1$, we also have $p \gg \frac{c}{\sqrt n}$ for all $c$, so almost all pairs of vertices are 2-hop neighbors (since $1 - e^{-c^2} \to 1$ as $c \to \infty$).
You're right that you're multi-counting central nodes in your approaches. This is the reason why they always yield an estimate of $n^3p^2$, even though this estimate is false for $np^2 \gg 1$.
There's another thing you're not being careful about, which is multiplying expectations: in general, for random variables $X$ and $Y$, $\mathbb E[X Y] \ne \mathbb E[X] \mathbb E[Y]$.
You make this mistake in both approaches; it's easiest to spot in the first. There, if $X$ is the number of neighbors of a node, you compute $\mathbb E[X] \sim np$. Then, you switch to talking about $\binom X2$, the number of pairs of neighbors. You claim that its average value is $\mathbb E \left[ \binom X2\right] \sim \binom {np}2$; however, the only thing we get for free is $\binom{\mathbb E[X]}{2} \sim \binom{np}{2}$, which is different.
For example, if a node is equally likely to have $0$ and $100$ neighbors, then $\mathbb E[X] = 50$, so $\binom{\mathbb E[X]}{2} = 1225$. However, $\binom X2$ is either $0$ or $4950$, so $\mathbb E \left[ \binom X2\right] = 2475$; more than twice as large.
You either need to compute $\mathbb E[X^2]$ directly, or you need to show that $X$ is tightly concentrated around its mean. Both of these take more work.