Variance of "grouped" summation of Bernoulli variables

61 Views Asked by At

Given $n$ objects (WLOG, $[n] = \{1, 2, 3, \ldots, n\}$) and two probabilities $p, q \in (0, 1)$. Starting from $X = [n]$ and $Y = \emptyset$, let us consider the following repeated process:

  1. Sample objects from $X$ with probability $q$ independently at random, and let the sampled set be $X_s$
  2. Toss a coin, with probability $p$, $Y = Y \cup X_s$; with probability $1 - p$, $Y$ is not changed
  3. $X = X \setminus X_s$
  4. Terminate if $X = \emptyset$; otherwise go back to 1.

We want to study the random variable $|Y|$ when the above process terminates. Two extreme cases are easy to analyze, i.e., $q \to 0$ or $q \to 1$.

  • When $q \to 0$, $|Y|$ is summation of $n$ independent bernoulli variables with probability $p$, and thus $\mathbb{E}[|Y|] = np$ and $Var[|Y|] = np(1-p)$
  • When $q \to 1$, $|Y|$ is determined by one single coin flipping, $\Pr[|Y| = n] = p$ and $\Pr[|Y| = 0] = 1 - p$, and thus $\mathbb{E}[|Y|] = np$ and $Var[|Y|] = n^2 p(1-p)$

I can see that the expectation is not affected by $q$ while the variance intuitively increases as $q$ increases, but I cannot get a closed-form formula of the variance yet. Any help or advice would be highly appreciated.

2

There are 2 best solutions below

0
On BEST ANSWER

Fact: $\Pr[i \in Y] = p, \forall i \in X$.

Let $Z_i$ denote the indicator variable of the event $i \in Y$. Then $\mathbb E[Z_i] = p$ and $Var[Z_i] = p(1-p), \forall i \in X$.

The probability of two elements $i, j \in X$ are determined together can be computed as follows:

  • Conditioned on "at least one between $i$ and $j$ is sampled", we have three cases (1) $i$ but not $j$, (2) $j$ but not $i$, and (3) both $i$ and $j$
  • $\Pr[\text{both $i$ and $j$} \mid \text{at least one}] = \frac{q^2}{2q(1-q) + q^2} = \frac{q}{2-q}$

Therefore, \begin{align} Var[|Y|] &= Var[\sum_i Z_i] \\ &= \sum_i Var[Z_i] + 2 \sum_{i < j} Cov[Z_i, Z_j] \\ &= np(1-p) + n(n-1) (\mathbb E[Z_i Z_j] - \mathbb E[Z_i] \mathbb E[Z_j]) \\ &= np(1-p) + n(n-1) (\frac{q}{2-q} p + \frac{2-2q}{2-q} p^2 - p^2) \\ &= np(1-p) + n(n-1) \frac{pq(1-p)}{2-q} \end{align}

3
On

Denote by $X_t$ and $Y_t$ the sizes of the sets $X$ and $Y$ at time $t.$ Also denote $p'=1-p$ and $q'=1-q.$ If I understand well $X_t$ is a Markov chain such that $X_{t+1}$ conditioned by $X_t$ has the binomial distribution $B(q',X_t)$. Also $Y_{t+1}=Y_t+\epsilon_{t+1}X_{t+1}$ where $(\epsilon_t)_{t\geq 0}$ is a sequence of iid rv such that $\Pr(\epsilon_t=1)=p$ and $\Pr(\epsilon_t=0)=p'.$ As a consequence $$Y_t=\sum_{k=0}^t\epsilon_kX_k$$ and you are asking for the distribution of $|Y|=\lim_{t}Y_t=\sum_{k=0}^{\infty}\epsilon_kX_k$ in the particular case where $X_0=n.$ To this end, let us show first that $$\mathbb{E}(z_0^{X_0}\ldots z_t^{X_t}|X_0)=\left((q')^{t}z_0 z_1\ldots z_t+qz_0\sum_{k=0}^{t-1}(q')^k z_1\ldots z_k\right)^{X_0}=A_t(z_0,\ldots,z_t)^{X_0}=A_t^{X_0}.\ \ (*)$$ We show () by induction by introducing $X_{-1}$ such that $X_0\sim B(q',X_{-1})$ and observing $$\mathbb{E}(z^{X_{-1}}z_0^{X_0}\ldots z_t^{X_t}| X_{-1}) =\mathbb{E}\left(z_{-1}^{X_{-1}}\mathbb{E}(z_0^{X_0}\ldots z_t^{X_t})|X_0\right) |X_{-1})=\mathbb{E}\left(z_{-1}^{X_{-1}}A_t^{X_0} \right)|X_{-1})$$$$=z_{-1}^{X_{-1}}(q'A_t+q)^{X_{-1}}=A_{t+1}(z_{-1},z_0,\ldots,z_t)^{X_{-1}}$$ Since () is true for $t=0$ the induction is proved.

Finally we prove that $$\mathbf{E}(z^{Y_t}|X_0)=(A_t(p'+pz))^{X_0}.$$ To see this we make a conditioning by the $(\epsilon_t)_{t\geq 0}$ $$\mathbf{E}(z^{Y_t}|\epsilon, X_0)=A_t(z^{\epsilon_0},\ldots,z^{\epsilon_t})^{X_0}$$ Therefore $$\mathbf{E}(z^{Y_t}|X_0)=A_t(p'+pz,\ldots,p'+pz)^{X_0}$$ Now, for $|z|\leq 1$ since $ \lim_t A_t(z,\ldots,z)=\frac{qz}{1-q'z}$ and if $X_0=n$ $$\mathbf{E}(z^{Y})=\left(\frac{q(p'+pz)}{1-q'(p'+pz)}\right)^{n}$$