Taking the average after randomly knocking out members

119 Views Asked by At

I'm looking at a random variable that takes vectors $\newcommand{\vv}{\mathbf{v}} \vv_1, \dotsc, \vv_n \in \mathbb{R}^d$ and calculates their average, after applying "blankout" noise to them. So we have independent Bernoulli random variables $\xi_1, \dotsc, \xi_n$ with parameters $p(\xi_i = 1) = p_i$. I multiply the $\newcommand{\vv}{\mathbf{v}} \vv_i$ by the $\xi_i$ and take the average of the ones that remain:

$$\newcommand{\vv}{\mathbf{v}} \frac{\xi_1 \vv_1 + \dotsb + \xi_n \vv_n}{\xi_1 + \dotsb + \xi_n}.$$

This might bother me because I'd be dividing by zero in the trivial case where I take the average of none of them. I suppose I could fix that by adding a dummy term to the denominator that will only show up if all $\xi_i$ are zero.

$$\newcommand{\vv}{\mathbf{v}} \frac{\xi_1 \vv_1 + \dotsb + \xi_n \vv_n}{\xi_1 + \dotsb + \xi_n + \prod_i(1 - \xi_i)}.$$

  1. Is there a clean way to calculate the mean and variance of such a thing*? (We could leave the $\prod$ term out if we want, and add some other fix.)
  2. Is there a better (perhaps standard) way to accomplish this? For instance, use constants $a_i>0$ such that $\mathbb{E}(a_i\xi_i) = \frac1n$, and write it as

$$\texttt{Avg} \approx \newcommand{\vv}{\mathbf{v}} a_1\xi_1 \vv_1 + \dotsb + a_n \xi_n \vv_n?$$

(This thing would have easy mean/variance.)


*Here I mean the mean/variance taken over the joint distribution of the $\xi_i$.)

2

There are 2 best solutions below

3
On BEST ANSWER

The "beautiful" results are when we either consider $p_i=p$ for all $i \in \{1, ..., n\}$ (as in my comment above), or if we consider heterogeneous $p_i$ values but let $n\rightarrow \infty$. Here is an answer that considers $n\rightarrow \infty$.


Claim:

Suppose that $\{\xi_i\}_{i=1}^{\infty}$ is a sequence of independent Bernoulli random variables with $p_i = P[\xi_i=1]$ for $i \in \{1, 2, 3, ...\}$, and $$ \liminf_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^n p_i > 0$$ Suppose $\{v_i\}_{i=1}^{\infty}$ is a deterministic sequence of vectors in $\mathbb{R}^n$. Suppose all vectors are bounded, so there is a $B\geq 0$ such that $||v_i||\leq B$ for all $i \in \{1, 2, 3, ...\}$. Then

$$ \lim_{n\rightarrow\infty}\left| \frac{\sum_{i=1}^n\xi_iv_i}{\sum_{i=1}^n \xi_i}-\frac{\sum_{i=1}^{n}p_iv_i}{\sum_{i=1}^np_i} \right| = 0 \quad \mbox{(with prob 1)} $$ In particular, $$ \lim_{n\rightarrow\infty} \frac{\sum_{i=1}^n\xi_iv_i}{\sum_{i=1}^n \xi_i} = \lim_{n\rightarrow\infty} \frac{\sum_{i=1}^{n}p_iv_i}{\sum_{i=1}^np_i} \quad \mbox{(with prob 1)}$$ whenever the right-hand-side limit exists.


This can be proven with the help of the following lemma.

Lemma:

Let $\{Y_i\}_{i=1}^{\infty}$ be independent random variables that are surely bounded by some constant $b\geq 0$, so that $|Y_i|\leq b$ for all $i \in \{1, 2, 3, ...\}$. Then $$ \lim_{n\rightarrow\infty} \left|\frac{1}{n}\sum_{i=1}^n (Y_i-E[Y_i]) \right| = 0 \quad \mbox{(with prob 1)} $$


Proof of claim:

Both $\{\xi_i\}_{i=1}^{\infty}$ and $\{\xi_i v_i\}_{i=1}^{\infty}$ are sequences of independent bounded variables and so by the lemma (applying the lemma component-wise for the case of the vector sequence $\{\xi_i v_i\}_{i=1}^{\infty}$): \begin{align} \epsilon_n &:= \frac{1}{n}\sum_{i=1}^n (\xi_iv_i - p_iv_i) \rightarrow 0 \quad \mbox{(with prob 1)} \\ \delta_n &:= \frac{1}{n}\sum_{i=1}^n (\xi_i-p_i) \rightarrow 0 \quad \mbox{(with prob 1)} \end{align} Hence \begin{align} \left| \frac{\sum_{i=1}^n\xi_iv_i}{\sum_{i=1}^n \xi_i}-\frac{\sum_{i=1}^{n}p_iv_i}{\sum_{i=1}^np_i} \right| &= \left|\frac{\frac{1}{n}\sum_{i=1}^n\xi_iv_i}{\frac{1}{n}\sum_{i=1}^n \xi_i}-\frac{\frac{1}{n}\sum_{i=1}^{n}p_iv_i}{\frac{1}{n}\sum_{i=1}^np_i} \right|\\ &=\left|\frac{\epsilon_n + \frac{1}{n}\sum_{i=1}^np_iv_i}{\delta_n + \frac{1}{n}\sum_{i=1}^n p_i}-\frac{\frac{1}{n}\sum_{i=1}^{n}p_iv_i}{\frac{1}{n}\sum_{i=1}^np_i} \right|\\ &=\frac{|\epsilon_n(\frac{1}{n}\sum_{i=1}^np_i) -\delta_n(\frac{1}{n}\sum_{i=1}^np_iv_i) |}{|(\delta_n + \frac{1}{n}\sum_{i=1}^n p_i)(\frac{1}{n}\sum_{i=1}^np_i)|} \rightarrow 0 \quad \mbox{(with prob 1)} \end{align} where we have used the fact that the denominator is positive and bounded away from zero (with prob 1), the terms $\frac{1}{n}\sum_{i=1}^np_i$ and $\frac{1}{n}\sum_{i=1}^np_iv_i$ are bounded, and $\epsilon_n\rightarrow 0$, $\delta_n\rightarrow 0$ with prob 1. $\Box$

1
On

For fixed $n$, assume $\{v_1, ..., v_n\}$ are vectors in $\mathbb{R}^d$ (for some positive integer $d$) with nonnegative components. Let be independent Bernoulli variables with $\{\xi_i\}_{i=1}^n$ $p_i=P[\xi_i=1]$, and define: $$V_n= \left\{ \begin{array}{ll} \frac{\sum_{i=1}^n \xi_i v_i}{\sum_{i=1}^n \xi_i}&\mbox{ if $\sum_{i=1}^n\xi_i>0$} \\ 0 & \mbox{ otherwise} \end{array} \right.$$ Define $K=\sum_{i=1}^n \xi_i$ and $K_i = \sum_{m\in \{1, …, n\}, m \neq i} \xi_m$ for $i \in \{1, ..., n\}$. Then \begin{align} E[V_n] &= \sum_{k=1}^n E[V_n|K=k]P[K=k] \\ &= \sum_{k=1}^n \sum_{i=1}^n \frac{v_i}{k} E[\xi_i | K=k]P[K=k]\\ &= \sum_{k=1}^n \sum_{i=1}^n \frac{v_i}{k}P[\xi_i=1|K=k]P[K=k] \\ &= \sum_{k=1}^n \sum_{i=1}^n \frac{v_i}{k}P[K=k|\xi_i=1]p_i \\ &= \sum_{i=1}^n p_i v_i \left(\sum_{k=1}^n \frac{1}{k}P[K_i=k-1]\right)\\ &= \sum_{i=1}^n p_i v_i E\left[\frac{1}{1+K_i}\right] \\ &\geq \sum_{i=1}^n p_i v_i \frac{1}{1+E[K_i]} \quad \mbox{(Jensen's inequality)} \\ &= \sum_{i=1}^n \frac{p_i v_i}{1 + \sum_{m \in \{1, … n\}, m \neq i} p_m} \end{align} The lower bound is interesting because it has a form similar to: $$ r_n := \frac{\sum_{i=1}^n p_iv_i}{\sum_{i=1}^n p_i}$$ and we know from the other answer that, under mild assumptions, $\lim_{n\rightarrow\infty} V_n = \lim_{n\rightarrow\infty}r_n$ with probability 1. In fact under those same mild assumptions, the lower bound also converges to $\lim_{n\rightarrow\infty} r_n$ and hence the lower bound is asymptotically exact.