Monte carlo estimation, but function value can only be estimated indirectly

153 Views Asked by At

I have the following setup.

I want to Monte Carlo estimate a big sum $$F=\sum_x p(x) f(x)$$ by drawing $\\{ x_1,\dots, x_M \\}$ from the distribution $p(x)$ and averaging $f(x_i)$. However, in my case $$f(x) = G( \alpha(x))$$ where $\alpha(x) \in [-1,1]$ and $G$ is a known bounded but non-linear function $ 0 \leq G(a)\leq 1$ for $a\in [-1,1]$.

The issue is that I cannot compute $\alpha(x)$ exactly for a given $x$. Instead, I can only estimate these quantities $\alpha(x)$ from a finite number of samples. To be more precise, for a given $x$ and corresponding $\alpha(x)$, I will draw $K$ Bernoulli random variables $A^{(x)}_1,\dots, A^{(x)}_K$ with $A^{(x)}_i \in \\{ -1, +1 \\}$ from a distribution such that $\mathbb{E} A^{(x)}_i = \alpha(x)$ to estimate $\alpha(x)$ via

$$\hat{\alpha}(x) = \frac{1}{K}\sum_{j=1}^K A^{(x)}_j$$

Hence, my final estimator is given by

\begin{align} \hat{F} &= \frac{1}{M} \sum_{i=1}^M G \left( \hat{\alpha}(x) \right) \\\\ &= \frac{1}{M} \sum_{i=1}^M G \left(\frac{1}{K}\sum_{j=1}^K A^{(x)}_j \right) \end{align}

My question is:

  1. Is $\hat{F}$ even an unbiased estimator of $F$? given that $G$ is non-linear, it doesnt seem like it?
  2. Say I wanted to show that with high probability $|\hat{F} - F| <\epsilon$ when drawing enough samples, i.e. choosing $K$ and $M$ large enough. How would you approach this? Can we control the variance of $\hat{F}$?
2

There are 2 best solutions below

1
On

You're right that, in general, $\hat{F}$ is not an unbiased estimator of $F$ because of the non-linearity of $G$. Here's an example.

Suppose $\alpha(x) = 0$ and $G(a) = a^2$. (Note that this means your Bernoulli distributions have $50\%$ chance of $-1$ and $50\%$ chance of $1$.) Then $F = 0$. But $\hat{\alpha}(x)$ is

  • $0$ if your Bernoulli samples balance out exactly (i.e. half are $-1$ and half are $1$),
  • positive otherwise.

This means that $\hat{F}$ has positive expectation, and is therefore biased.


For the second part of your question I don't have an answer that completely solves the whole thing, but I think it's important to understand that $\hat{F}$ being close to $F$ is not the same as the variance of $\hat{F}$ being small, exactly because of the bias. In order for $\hat{F}$ to approximate $F$ well, you need both its variance and its bias to be small. This observation is usually called the bias-variance decomposition.

Assuming you have lots of samples of $\hat{F}$, you can estimate the variance the usual way (i.e. just check whether they're close together). That leaves you with the problem of understanding how big the bias is. This will require some assumptions on $G$, but for a reasonably nice $G$ the bias will shrink to zero as $K\to\infty$. Assuming $G$ is smooth enough and $K$ is big enough that $\hat{\alpha}$ is very concentrated near $\alpha$, the bias will be about proportional to the second derivative of $G$ at $\alpha$. So whether $G$ is very non-linear or just a bit non-linear will dictate whether the bias is large.

To gain intuition about why the second derivative of $G$ dictates the bias, you could work out the bias of $\hat{F}$ exactly for the example above (in which case the math works out immediately, because $G$ is equal to its second order Taylor approximation). Then try replacing $G$ with a general quadratic polynomial and repeat the exercise.

1
On

About the second question, you can use Hoeffding's inequality twice. From this inequality, using $K=\frac{2}{\epsilon^2}\log \frac{2}{\delta}$ samples,

\begin{equation} \mathbb{P}\left( |\hat{\alpha}(x)-\alpha(x)|\geq \epsilon\right)\leq \delta \end{equation}

Next, assuming that the function $G$ is Lipschitz continuous with Lipschitz constant $L$ (In case the function is not Lipschitz continuous, you can try other types of continuities, for instance checking if the function is Hölder continuous). Therefore, for any set of $M$ points $\{ x_i \}_{i=1}^M$,

\begin{equation} \left|\frac{1}{M}\sum_{i=1}^M G(\alpha(x_i))- \frac{1}{M}\sum_{i=1}^M G(\hat{\alpha}(x_i))\right|\leq L \epsilon \end{equation}

with probability at least $1-M\delta$. This follows from the union bound. Similarly, using Hoeffding's inequality,

\begin{equation} \mathbb{P}\left( \left|\frac{1}{M}\sum_{i=1}^M G(\alpha(x_i))-\mathbb{E}[G(\alpha(x))]\right|\geq \epsilon_2\right)\leq \delta_2 \end{equation}

where $M=\frac{1}{2\epsilon_2^2} \log \frac{2}{\delta_2}$. Finally, combining both results,

\begin{equation} \left|\frac{1}{M}\sum_{i=1}^M G(\hat{\alpha}(x_i))-\mathbb{E}[G(\alpha(x))]\right|\leq \epsilon_2+L \epsilon \end{equation}

with probability at least $1-\delta M -\delta_2$. For example, you can take $\delta_2=\delta_0/2$, $\delta=\delta_0/2M$, $\epsilon_2=\epsilon_0/2$, and $\epsilon=\epsilon_0/(2L)$, yielding

\begin{equation} \left|\frac{1}{M}\sum_{i=1}^M G(\hat{\alpha}(x_i))-\mathbb{E}[G(\alpha(x))]\right|\leq \epsilon_0 \end{equation}

with probability at least $1-\delta_0$, where $M=\frac{2}{\epsilon_0^2}\log \frac{4}{\delta_0}$, and $K=\frac{8L^2}{\epsilon_0^2} \log \frac{4M}{\delta_0}$.