Although Monte Carlo (MC) is an easy-implementating algorithm, but I'm still confused with the sampling theory behind it.
Especially, the MC can approximate an expectation with
$$ \int f(X(\omega))\mathbb{P}(d\omega) \approx \sum_i^N f(X(\omega_i)), $$ where $(X, \Omega, \mathbb{P})$ the probability space, and $\omega_i$ some samples from sampling space.
My first intuition is to use quadrature in a deterministic way, i.e., Gauss-Hermite. Although it would be great to compute it with evaluation of samples, but I don't quite understand why summation over samples work here.
My question is, why can we calculate the above integral with that summation using samples? What are the theory and mathematical formulations behind it?
My attempt: Recall the definition of integral. Suppose that $X$ and $f$ are simple: $X=\sum_i^Ka_i \chi_{A_i}$, where $A_i = X^{-1}\{a_i\}$ and $\chi$ the indicator function. Thus $\int f(X(\omega))\mathbb{P}(d\omega) = \sum_i^Ka_i\mathbb{P}(A_i)\approx1/N \sum_i^N f(X(\omega_i))$, where $a_i\in A_i$
The secret lies in strong law of large numbers $(SLLN)$
Consider $\mathbb{E}\{X \} = \int x p_X(x) dx$
If we could get many i.i.d samples $x_1, x_2, ..., x_N$ from the distribution $p_X$, i.e. each $x_i \sim p_X$, then by the law of large numbers, we have as $N \to \infty$, $$\frac{1}{N} \sum_{i=1}^N x_i \to \mathbb{E}[X] = \int x p_X(x) dx$$
So basically, we have $$\frac{1}{N}\sum_{i=1}^N x_i \to \int x p_X(x) \, \,\, \,\, \,\, \,\, \,\, \,\, \,\, \,\, \,\, \,\, \,\, \, (SLLN)$$
So consider $\int f (x) dx = \int \frac{ f(x)}{p_X(x)} p_X(x) dx$ assuming $p_X(x) \neq 0$ over the interval of integration.
Now, if we have i.i.d samples $x_1, ..., x_N$, then by analogy with $(SLLN)$, we have, as $N$ becomes large,
$$\frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{p_X(x_i)} \to \int \frac{ f(x)}{p_X(x)} p_X(x) dx = \int f(x) dx$$
So we are estimating the integral $\int f(x) dx$ with $\frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{p_X(x_i)}$.
The choice of $p_X$ is arbitrary. The variance of the estimate depends crucially on the choice of the sampling distribution. Related concepts are importance sampling and variance reduction.