How to take into account uncertainty on number of events

307 Views Asked by At

Suppose I generate a set of events $X_{i}$ for $i = 1,2 \dots N$ and suppose every event is either a success or a failure, ie. $X_{i} = 0, 1$. If $N$ is fixed, the MLE for the probability of success is just $$\hat{p} = \frac{1}{N}\sum_{i = 1}^{N} X_{i}$$ and the variance of the MLE can be estimated as $$V(\hat{p}) = \frac{\hat{p}(1 - \hat{p})}N.$$ But now suppose that $$N \sim \text{Poisson}(\lambda_N),$$ where $\lambda_{N}$ is fairly large, so if we need to, this could be approximated as $$N \sim \text{Normal}\left(\mu = \lambda_{N}, \sigma^2 = \lambda_{N}\right).$$ The MLE is still just $$\hat{p} = \frac{1}{N}\sum_{i = 1}^{N} X_{i}$$ but the variance is increased. What I want to know is how to calculate the new variance of $\hat{p}$ by taking into account the uncertainty on $N$. I tried some error propagation on $N$ but I can't quite reproduce numerical results.

2

There are 2 best solutions below

6
On BEST ANSWER

The formula for the variance of 'a random sum of random variables' is given in many probability texts. It is derived by conditioning as suggested by @Augustin. If $X_i, \dots, X_N$ are iid and $N$ is independent of the $X$'s, then the sum $S$ has variance $$V(S) = E(N)V(X) + V(N)[E(X)]^2.$$ Roughly speaking, the second term expresses the additional variability due to random $N$ as compared with fixed $n$. As you might expect, $E(S) = E(N)E(X).$

Notice the units in the formula for $V(S).$ Terms for $N$ are pure numbers. Terms for $X$ have squared units to match units of $S$.

In many cases, a normal approximation for $S$ gives good results. If the $X$'s are binomial, as in your case, normal approximation should work fairly well if $E(N)$ is moderately large. (But I would not trust it if $X$'s are extremely skewed with long tails, as for exponential, unless $E(N)$ is large.)

In practical cases, simulation is an easily programmed option--if only as a check on the accuracy of a normal approximation. (Depending on the language used, you might have to make explicit accommodation for cases where $N = 0$.)

Here is a simulation in R of 100,000 experiments with $N \sim Pois(10)$ and $X_i \sim Binom(1, 1/2) = Bernoulli(1/2).$ As the histogram shows, the normal fit is not excellent because the Poisson mean is relatively small.

NOTE: The dark dots atop the histogram bars are for $Pois(5)$, which is the exact distribution of $S$ in this particular case. Here $S$ is simply $N$ with half of its realizations filtered out at random. This result can be established analytically using moment generating functions. (Imagine thin lead foil blocking half of the particles emitted from a chunk of uranium. What gets through is still Poisson.)

 m = 10^5;  s = numeric(m)
 for (i in 1:m) {
   n = rpois(1, 10)
   s[i] = sum(rbinom(n, 1, .5)) }
 mean(s);  sd(s)
 ##  4.99118  # approx E(S) = 5
 ##  2.231150 # approx SD(S), where V(S) = 5
 sqrt(10*.25 + 10*.25)
 ##  2.236068 # exact SD(S)

 mean(s < 5)
 ## 0.4425    # simulated P(S < 5)
 pnorm(4.5, 5, sqrt(5))
 ## 0.4115316 # norm approx P(S < 5)

enter image description here

ADDENDUM: OP seems interested in the distribution of $S/N$ (see Comments). Here is histogram from code slightly altered to retain values of both $S$ and $N$ at each iteration.

enter image description here

0
On

Variance of the Sum of $\boldsymbol{n}$ Items Where $\boldsymbol{n}$ is Variable

To complete the discussion, I think it would be nice to include a derivation of the formula for the variance where $n$ is variable, for those that do not have access to an appropriate text. $\newcommand{\E}{\operatorname{E}}\newcommand{\Var}{\operatorname{Var}}$

For a fixed $n$, we have $$ \E\!\left[\sum_{k=1}^nX_k\right]=n\E[X]\tag{1} $$ and, using $\Var[X]=\E\!\left[X^2\right]-\E[X]^2$, we get $$ \begin{align} \E\left[\left(\sum_{k=1}^nX_k\right)^2\right] &=n\E\!\left[X^2\right]+(n^2-n)\E[X]^2\\ &=n\Var[X]+n^2\E[X]^2\tag{2} \end{align} $$ Taking the expectation of $(1)$ in $n$ yields $$ \E\!\left[\sum_{k=1}^nX_k\right]=\E[n]\E[X]\tag{3} $$ and, using the fact that $\E\!\left[n^2\right]=\Var[n]+\E[n]^2$, taking the expectation of $(2)$ in $n$ gives $$ \begin{align} \E\left[\left(\sum_{k=1}^nX_k\right)^2\right] &=\E[n]\Var[X]+\E\!\left[n^2\right]\E[X]^2\\ &=\E[n]\Var[X]+\Var[n]\E[X]^2+\E[n]^2\E[X]^2\tag{4} \end{align} $$ Thus, $$ \begin{align} \mathrm{Var}\left[\sum_{k=1}^nX_k\right] &=\E\left[\left(\sum_{k=1}^nX_k\right)^2\right]-\E\!\left[\sum_{k=1}^nX_k\right]^2\\[6pt] &=\bbox[5px,border:2px solid #C0A000]{\E[n]\Var[X]+\Var[n]\E[X]^2}\tag{5} \end{align} $$


Variance of the Mean of $\boldsymbol{n}$ Items Where $\boldsymbol{n}$ is Variable

I think I misunderstood the question before the last edit. It appears that you are looking for the variance of $\frac1n\sum\limits_{k=1}^nX_k$ where $n$ is variable. If that is the case, we can follow the same approach as above:

For a fixed $n$, we have $$ \E\!\left[\frac1n\sum_{k=1}^nX_k\right]=\E[X]\tag{6} $$ and, using $\Var[X]=\E\!\left[X^2\right]-\E[X]^2$, we get $$ \begin{align} \E\left[\left(\frac1n\sum_{k=1}^nX_k\right)^2\right] &=\frac1n\E\!\left[X^2\right]+\left(1-\frac1n\right)\E[X]^2\\ &=\frac1n\Var[X]+\E[X]^2\tag{7} \end{align} $$ Taking the expectation of $(6)$ in $n$ yields $$ \E\!\left[\frac1n\sum_{k=1}^nX_k\right]=\E[X]\tag{8} $$ and taking the expectation of $(7)$ in $n$ gives $$ \E\left[\left(\frac1n\sum_{k=1}^nX_k\right)^2\right] =\E\!\left[\frac1n\right]\Var[X]+\E[X]^2\tag{9} $$ Thus, $$ \begin{align} \mathrm{Var}\!\left[\frac1n\sum_{k=1}^nX_k\right] &=\E\left[\left(\frac1n\sum_{k=1}^nX_k\right)^2\right]-\E\!\left[\frac1n\sum_{k=1}^nX_k\right]^2\\[6pt] &=\bbox[5px,border:2px solid #C0A000]{\E\!\left[\frac1n\right]\Var[X]}\tag{10} \end{align} $$ Note that by Cauchy-Schwarz, we have $$ \E\!\left[\frac1n\right]\ge\frac1{\E[n]}\tag{11} $$ Since a Poisson distribution allows $n=0$ and the mean of $0$ items is not well-defined, we will let $n-1$ have a Poisson distribution with mean $N-1$ so that we still have $\mathrm{E}[n]=N$. $$ \begin{align} \E\!\left[\frac1n\right] &=e^{-N+1}\sum_{k=0}^\infty\frac{(N-1)^k}{(k+1)\,k!}\\ &=\frac{e^{-N+1}}{N-1}\sum_{k=0}^\infty\frac{(N-1)^{k+1}}{(k+1)!}\\ &=\frac{1-e^{-N+1}}{N-1}\\[6pt] &\sim\frac1{N-1}\tag{12} \end{align} $$