I started to read about a problem of sampling from a very complicated distribution. The article state a problem in the following way (not a citation):
You have finite (potentially very big) $N$ elements and for each of them you know their probability. You need to find an algorithm that quickly draws me an element based on this distribution.
Most of the time, before reading something, I try to come up with my solutions and only after that I read suggested solutions. My first solution was kind of obvious:
- Uniformly select $N$. When you selected $N$ you know the probability of $N$. Now throw a biased coin and on success you basically select that $N$. On failure you redo everything again. It is clear that having a big $N$, I will spend huge amount of time redoing the drawing and this algorithm is inefficient.
My second approach is harder to describe, so I will use an example. Let $N=4$ and the probabilities are [0.3, 0.05, 0.25, 0.4]. So my approach is to construct the cumulative sum [0.3, 0.35, 0.6, 1]. Now find the smallest divisible factor (not what is the right term, but something similar to LCM for fractions) for the probabilities. Here it will be $0.05$. Now uniformly draw a number from 0 to 20 (1/0.05). Multiply this by smallest divisible factor and check to what region in our cumulative sum array it belongs.
This approach looks more feasible and the only problem I see right now is when the smallest divisible factor is too small, which results in too big region to draw a uniform number.
I quickly looked at the suggested solution and it is Metropolis-Hasting algorithm, which has nothing to do with what I proposed. Which means that there are some big problems with my approach.
So can anyone show me why my approach is inefficient (or may be totally wrong).
The first method you propose is the simplest form of rejection sampling, using a uniform proposal distribution. It will be problematic if there are many outcomes with low probability; e.g., $$\Pr[N = x] = \begin{cases} 10^{-6}, & x \in \{1, 2, \ldots, 10^5\} \\ 0.9, & x = 0. \end{cases}$$ In such a scenario, you would take a very long time to obtain one realization, since the probability of choosing $N = 0$ is very small, and the rest of the time, the chance of accepting any $N > 0$ is also very unlikely. You have already correctly identified the flaw in this approach.
The second method you propose is a form of inverse transform sampling, modified for discrete outcomes. This is certainly better than the first method, but your implementation is problematic when the spacing between cumulative probabilities does not admit a sufficiently large common increment, again as you correctly identified.
If you simply performed true inverse transform sampling, this would be better: refer to the Wikipedia article on the topic.
Regarding Metropolis-Hastings, the benefit of this method has to do with the ability to sample from distributions with very high dimensions in a way that sacrifices independence (i.e., realizations are autocorrelated) but gains increased probability of acceptance and therefore computational efficiency. You generally don't really have this problem in a discrete case with one dimension, even if the number of outcomes is large.