Naive Monte Carlo Sampling vs. Importance Sampling

233 Views Asked by At

Can someone help me understand this paragraph:

The naive Monte Carlo estimator introduced in the last section performs well if the prior and posterior distribution have a similar shape and strong overlap. However, the estimator is unstable if the posterior distribution is peaked relative to the prior. In such a situation, most of the sampled values for θ result in likelihood values close to zero and contribute only minimally to the estimate. This means that those few samples that result in high likelihood values dominate estimates of the marginal likelihood. Consequently, the variance of the estimator is increased

Why if the posterior distribution is peaked relative to the prior, the sampled values for θ would result in likelihood values close to zero and contribute only minimally to the estimate, and why the variance of the estimator is increased?

$$ \hat{p}_{1}(y)=\underbrace{\frac{1}{N} \sum_{i=1}^{N} p\left(y \mid \tilde{\theta}_{i}\right)}_{\text {average likelihood }}, \quad \underbrace{\tilde{\theta}_{i} \sim p(\theta)}_{\begin{array}{c} \text { samples from the } \\ \text { prior distribution } \end{array}} . $$

The point I don't understand is why does the estimation of the marginal likelihood depend on the posterior when

$$ \underbrace{p(y \mid \mathcal{M})}_{\begin{array}{c} \text { marginal } \\ \text { likelihood } \end{array}}=\int \underbrace{p(y \mid \theta, \mathcal{M})}_{\text {likelihood }} \underbrace{p(\theta \mid \mathcal{M})}_{\text {prior }} \mathrm{d} \theta, $$

which means the marginal likelihood depends only on prior

1

There are 1 best solutions below

0
On

To trace out a density in the Monte Carlo method, we have to know (roughly) where the probability mass is. However, while we know where the prior is, we don't know where the posterior is. If the posterior is very peaky then most of the draws may end up being almost meaningless.

For e.g., suppose, the prior is diffuse/flat from -10 to 10. Suppose the posterior is $\mathcal{N}(-3, 0.1)$. To trace out the posterior, it would make sense to take uniform draws from -10 to 10. However, only a small sliver of those draws (0.6/20) will fall within the interval -2.7 to -3.3 (i.e., 3 standard deviations of the posterior). Its easy to see that net of net, we are doing a lot of work to establish that the posterior is almost 0 from -10 to -4 and from -2 to 10 but that most of the information actually comes from a small percentage of the draws. Of course if the number of draws goes to $\infty$ then 0.6/20 * $\infty$ is $\infty$ so the Monte Carlo method is fine but in practice where we have finite time and resources, we often seek a (computationally) better method.