Given $n$ objects (WLOG, $[n] = \{1, 2, 3, \ldots, n\}$) and two probabilities $p, q \in (0, 1)$. Starting from $X = [n]$ and $Y = \emptyset$, let us consider the following repeated process:
- Sample objects from $X$ with probability $q$ independently at random, and let the sampled set be $X_s$
- Toss a coin, with probability $p$, $Y = Y \cup X_s$; with probability $1 - p$, $Y$ is not changed
- $X = X \setminus X_s$
- Terminate if $X = \emptyset$; otherwise go back to 1.
We want to study the random variable $|Y|$ when the above process terminates. Two extreme cases are easy to analyze, i.e., $q \to 0$ or $q \to 1$.
- When $q \to 0$, $|Y|$ is summation of $n$ independent bernoulli variables with probability $p$, and thus $\mathbb{E}[|Y|] = np$ and $Var[|Y|] = np(1-p)$
- When $q \to 1$, $|Y|$ is determined by one single coin flipping, $\Pr[|Y| = n] = p$ and $\Pr[|Y| = 0] = 1 - p$, and thus $\mathbb{E}[|Y|] = np$ and $Var[|Y|] = n^2 p(1-p)$
I can see that the expectation is not affected by $q$ while the variance intuitively increases as $q$ increases, but I cannot get a closed-form formula of the variance yet. Any help or advice would be highly appreciated.
Fact: $\Pr[i \in Y] = p, \forall i \in X$.
Let $Z_i$ denote the indicator variable of the event $i \in Y$. Then $\mathbb E[Z_i] = p$ and $Var[Z_i] = p(1-p), \forall i \in X$.
The probability of two elements $i, j \in X$ are determined together can be computed as follows:
Therefore, \begin{align} Var[|Y|] &= Var[\sum_i Z_i] \\ &= \sum_i Var[Z_i] + 2 \sum_{i < j} Cov[Z_i, Z_j] \\ &= np(1-p) + n(n-1) (\mathbb E[Z_i Z_j] - \mathbb E[Z_i] \mathbb E[Z_j]) \\ &= np(1-p) + n(n-1) (\frac{q}{2-q} p + \frac{2-2q}{2-q} p^2 - p^2) \\ &= np(1-p) + n(n-1) \frac{pq(1-p)}{2-q} \end{align}