Simple Monte Carlo uncertainty quantification: why are more samples required with additional uncertain inputs?

99 Views Asked by At

Having read a little about simple Monte Carlo methods, I understand that, given a random variable $X$, the difference between its true mean $\mu$ and its sample mean $\hat{\mu}_n$ is of order $\sigma / \sqrt{n}$ where $\sigma$ is the standard deviation of $X$ and $n$ is the number of samples.

Simple Monte Carlo methods are often used to account for input uncertainties in flood inundation models, where the inputs include initial conditions, boundary conditions and model parameters. In this context, it is often argued (e.g. Neal et al. 2013) that the number of samples becomes prohibitively large as more classes of input uncertainty are introduced.

Since the rate of convergence is independent of dimensionality (where the dimensionality is the number of input uncertainties), then why is this argument valid? Is it because the standard deviation in the model output with the number of dimensions? Or is that the "combined standard deviation" of all the uncertain inputs somehow grows?

I am particularly interested in an intuitive explanation as well as a mathematically rigorous one.

1

There are 1 best solutions below

0
On BEST ANSWER

The problem is that in high dimensions, the driver of the population mean may be concentrated in what turns out to be an extremely small part of the sample space, and your Monte Carlo methods may be unlikely to find that part

As a relatively simple example, suppose that you are sampling evenly from an $d$-dimensional hypercube $[0,2]^d$ of side $2$ and you are interested in the product of the $d$ coordinates. It should be fairly obvious that the product can be anything from $0$ to $2^d$, and that its expectation is $1$. Its variance is $\sigma^2=\left(\frac43\right)^d-1$, which will turn out to be very large for large $d$

Now let's try to simulate this with $100000$ samples, using the following R code

sampleproducts <- function(dimensions, cases=10^5, minval=0, maxval=2){
  coords <- matrix(runif(cases * dimensions, minval, maxval), ncol=dimensions)
  products <- exp(rowSums(log(coords)))
  return(products)
}

and with $10$ dimensions we might for example get

set.seed(1)
products10d <- sampleproducts(10)
summary(products10d)
#     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#  0.00000   0.00678   0.06459   1.00051   0.45429 217.90770 

where the sample mean is close to the population mean of $1$ even though this is more than $15$ times the median. So no real problem here

Now try $1000$ dimensions:

set.seed(2019)
products1000d <- sampleproducts(1000)
summary(products1000d)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
# 0.000e+00 0.000e+00 0.000e+00 5.300e-83 0.000e+00 5.293e-78 

Disaster! Most of the sample products are so small they do not even register in the summary table. Even the largest here is below $10^{-77}$, so the sample mean is inevitably smaller than that. The sample mean is proportionately a long way from the true expectation of $1$, because so very little of your sample space would produce substantial values

This does not mean that your standard deviation argument is wrong, just that it is not particularly useful. Here the standard error of the sample mean $\frac{\sigma}{\sqrt{n}}$, with a sample size of $n=100000$, is about $0.013$ when $d=10$ but almost $10^{60}$ when $d=1000$. This later figure is incredibly large considering that all the numbers involved are positive but the expected value you are trying to estimate is only $1$

In reality, you do not know where your Monte Carlo model needs to be sampling to get useful values, and having a complicated underlying population with a large number of dimensions will make this more difficult. Nor are you likely to know the population variance: if you did then presumably you know other things too and might not need to use Monte Carlo methods