What is wrong with my idea of sampling from complicated distribution

130 Views Asked by At

I started to read about a problem of sampling from a very complicated distribution. The article state a problem in the following way (not a citation):

You have finite (potentially very big) $N$ elements and for each of them you know their probability. You need to find an algorithm that quickly draws me an element based on this distribution.

Most of the time, before reading something, I try to come up with my solutions and only after that I read suggested solutions. My first solution was kind of obvious:

  • Uniformly select $N$. When you selected $N$ you know the probability of $N$. Now throw a biased coin and on success you basically select that $N$. On failure you redo everything again. It is clear that having a big $N$, I will spend huge amount of time redoing the drawing and this algorithm is inefficient.

My second approach is harder to describe, so I will use an example. Let $N=4$ and the probabilities are [0.3, 0.05, 0.25, 0.4]. So my approach is to construct the cumulative sum [0.3, 0.35, 0.6, 1]. Now find the smallest divisible factor (not what is the right term, but something similar to LCM for fractions) for the probabilities. Here it will be $0.05$. Now uniformly draw a number from 0 to 20 (1/0.05). Multiply this by smallest divisible factor and check to what region in our cumulative sum array it belongs.

This approach looks more feasible and the only problem I see right now is when the smallest divisible factor is too small, which results in too big region to draw a uniform number.


I quickly looked at the suggested solution and it is Metropolis-Hasting algorithm, which has nothing to do with what I proposed. Which means that there are some big problems with my approach.

So can anyone show me why my approach is inefficient (or may be totally wrong).

2

There are 2 best solutions below

0
On BEST ANSWER

The first method you propose is the simplest form of rejection sampling, using a uniform proposal distribution. It will be problematic if there are many outcomes with low probability; e.g., $$\Pr[N = x] = \begin{cases} 10^{-6}, & x \in \{1, 2, \ldots, 10^5\} \\ 0.9, & x = 0. \end{cases}$$ In such a scenario, you would take a very long time to obtain one realization, since the probability of choosing $N = 0$ is very small, and the rest of the time, the chance of accepting any $N > 0$ is also very unlikely. You have already correctly identified the flaw in this approach.

The second method you propose is a form of inverse transform sampling, modified for discrete outcomes. This is certainly better than the first method, but your implementation is problematic when the spacing between cumulative probabilities does not admit a sufficiently large common increment, again as you correctly identified.

If you simply performed true inverse transform sampling, this would be better: refer to the Wikipedia article on the topic.

Regarding Metropolis-Hastings, the benefit of this method has to do with the ability to sample from distributions with very high dimensions in a way that sacrifices independence (i.e., realizations are autocorrelated) but gains increased probability of acceptance and therefore computational efficiency. You generally don't really have this problem in a discrete case with one dimension, even if the number of outcomes is large.

8
On

Once you have computed the cumulative sums of the probabilities $\sum_{i=1}^k p_i$, simply choose $X$ uniformly over $[0,1]$, then use binary search to find the two cumulative sums that $X$ lies between. If $X$ lies between $\sum_{i=1}^{k-1} p_i$ and $\sum_{i=1}^k p_i$, then choose element number $k$.

The computational cost has two main parts: (1) the cost of constructing a list of cumulative sums on which you can do the binary search, and (2) the binary search. Part (2) of the cost is incurred every time you draw a random element from your set of $N$ elements, but it costs only $O(\log N)$ time for each element drawn (because of the binary search). Part (1) only needs to be done once for a given distribution, and then you can draw random elements as many times as you like using the same list of sums.


But the method above assumes that it is practical to construct a list of the cumulative probabilities. Looking at the first two paragraphs after "The Problem is Drawing from a Distribution" in https://jeremykun.com/2015/04/06/markov-chain-monte-carlo-without-all-the-bullshit/, the space from which we want to sample is the space of all possible baby names. To help us do this, we are given a "magic box" (much like one of the "oracles" favored by theoretical computer science) that takes a string as input and outputs the probability that this string would be chosen as a baby name.

How large is the sample space? Without a lot of information about the working of the "magic box" (and the "magic" implies we know very little of that), we cannot rule out any seemingly random concatenation of syllables such as "Dweezil", nor long names such as "Rumplestiltskin", nor concatenations of other names such "Philomenakatharinedelilah" -- in other words, the state space is huge. Based on a back-of-the-envelop calculation, "larger than the existing storage capacity of the human race" seems a low estimate.

If we assume we have at least a little a priori knowledge about the responses the "magic box" will give, for example that "Emily" will be much more likely than "Jenniferabellannamariedummy", we can set up partial table of cumulative probabilities using all the names in a large database of names actually given, which will cover all but the extremely rare names. Moreover, if we have a deterministic algorithm to generate strings of letters (other than these names) so that more plausible "names" are produced earlier in the sequence of outputs, heuristically we might hope as we feed these names to the "magic box" the probabilities will accumulate much faster than we would achieve by a more "uniform" method of choosing random strings as names.

The idea here is that the stored data set, while initially fairly large, is well within the storage capacity of commonly-available computing platforms, and that even if we choose to extend our data set as we sample new names from our "name-generating" algorithm, the data set will still grow much more slowly than the number of times we sample the probability space. (There are ways to make it grow very slowly by storing only a subset of the cumulative probabilities we calculate.)


As for the "solution" in the linked article, I'll confess I don't see what it has to do with the problem. Perhaps it's treating names as a finite multidimensional space, with each letter being one of the dimensions, but if so I missed that explanation.