Determine number of trials in Monte Carlo simulation

1.5k Views Asked by At

Assume we estimate $P(C) = P(A+B)$, where $P(B|\overline{A})$ is given, using Monte Carlo simulation in two different ways:

  1. $P(C)$ is equals to number of event $C$ occurences in $n$ independent experiments
  2. At first we estimate frequency $\frac{m}{n}$ of event $A$ occurence in $n$ independent experiments and then we compute $P(C)$ as $$P(C) \approx P_n(C)=\frac{m}{n} + (1-\frac{m}{n})P(B|\overline{A})$$

If $P(B|\overline{A}) = 0.3$ and $P(A) = 0.4$, how to determine (for both ways) the minimal number of trials, that is sufficient to get absolute error of $P(C)$ estimation less than $0.01$ with probability $\geq 0.95$?

Thank you for any help!

1

There are 1 best solutions below

5
On BEST ANSWER

Unfortunately without knowing more about the variance in the probability distribution you are working with, it is not possible to give an explicit answer of how many samples are needed.

We can however say what order of magnitude of samples is required to get that level of precision.

Let $X_j$ be the outcome of one of the independent Monte Carlo runs, and write $Y_j = \textbf{1}(X_j \in C)$ to denote the indicator that $C$ occured: i.e. $Y_j = 0$ if the event $C$ did not happen, and $1$ if it did.

You can then show that: $$ \mathbb{P}[C] = \mathbb{E}[Y_j] = \mathbb{E}\left[\frac1N \sum_{j=1}^N Y_j \right] $$ which follows from the fact that the $Y_j$ are i.i.d., and basic properties of expectations.

Your question is then: how big does $N$ need to be, for $\frac1N \sum_{j=1}^N Y_j$ to be a good approximation of $ \mathbb{E}\left[\frac1N \sum_{j=1}^N Y_j \right]$.

This can now be recognized as an application of the Central Limit Theorem, which tells us:

$$\lim_{N \rightarrow \infty } \sqrt{N} \left( \frac1N \sum_{j=1}^N Y_j - \mathbb{P}[C] \right) = \mathcal{N}\big( 0, \sigma^2 \big) $$

This formula only holds in the limit $N \rightarrow \infty$. This is an essential assumption.

However, allowing some grace from theoretical statements,and assuming that the approximation is good for sufficiently large $N$, then we can rearrange the statement above (using properties of variance) to give: $$ \left( \frac1N \sum_{j=1}^N Y_j - \mathbb{P}[C] \right) \sim \mathcal{N}\left(0, \frac{\sigma^2}{N} \right) $$

Continuing the `rough' analysis, you would then look to see how large $N$ would need to be for a a $\mathcal{N}\left(0, \frac{\sigma^2}{N} \right)$ distributed random variable to be in [-0.01,0.01] with a probability of 0.95.

Since the general approxiation to a 95% confidence interval for a Normal distribution is 1.96 times the standard deviation, we require:

$$ 1.96 \frac{\sigma}{\sqrt N} = 0.01 $$

or equivalently:

$$ N = \big( 196 \sigma \big)^2 $$

So we see in the end that we cannot say exactly how big $N$ must be without knowing what the standard deviation $\sigma$ of the underlying distribution $X$ is.

Also, let me stress again that this is all only true in the limit, but is a pragmatic way that an analyst might use the Central Limit Theorem in real world applications.

Update In the scenario of this specific question, we can make use of one extra piece of information: that we already know the exact form of the probability we are approximating. I.e. we know that the event has probability $0.58$; for the remainder we denote $p = 0.58$.

To see how we can now use this, we know that each of the events $Y_j \sim \text{Bin}(p)$, and moreover that $\sum_{j=1}^N Y_j = \text{Ber}(N,p)$.

As such, we are in a position to apply the De Moivre - Laplace theorem to conclude:

$$ \sum_{j=1}^N Y_j \sim \mathcal{N}\big(Np, Np(1-p) \big) $$

which holds asymptotically as $N \rightarrow \infty$. Taking some care (by which I mean you will want to check that this is analytically sound), we can divide both sides by $N$ so that we regain the Monte-Carlo approximation to $\mathcal{P}[C]$:

$$ \frac{1}{N}\sum_{j=1}^N Y_j \sim \mathcal{N}\big(p, N^{-1}p(1-p) \big) $$

We can now continue as in the original answer above, by taking $1.96$ times the standard deviation to identify the 95% confidence interval. So for this to be $\pm 0.01$ we require:

$$1.96 \sqrt{ \frac{p(1-p)}{N} } = 0.01$$

Rearranging this for the $p = 0.58$ I obtain $N = 9358$, rounded to the nearest integer.

I am aware that this is slightly lower than the answer the textbook gives (stated in the comments). Hopefully this is sufficient enough direction for you to look into the discrepancy.

The second scenario in your question can now be solved analogously, as we are now doing Monte Carlo simulation for the event $A$; you do however have to scale to account for the fact that we are multiplying by the known event $\mathbb{P}[B | \overline{A}]$. Denoting $q = \mathbb{P}[B | \overline{A}] = 0.3$, we have:

$$ \mathbb{P}[C] = q + (1-q)\mathbb{P}[A] $$

All of the variance from this is contributed by the term $(1-q) \mathbb{P}[A]$; letting $Y_j = \mathbf{1}(X_j \in A)$, and applying De Moivre - Laplace as before we have:

$$(1-q) \frac{1}{N} \sum_{j=1}^N Y_j \sim \mathcal{N}\big(p, N^{-1}p(1-p) (1-q)^2 \big) $$

From here the same steps can be followed as in the example for event $C$ to derive an estimate for the number of iterations $N$ required.