Why is CDF the only way to randomly select from samples?

263 Views Asked by At

I am confused with the concept of cumulative distribution function (CDF).
I see it helps in algorithms related to sampling of data. So for instance if we have a list of values, and we randomly want to pick up some elements with probability dependent on their value, CDF helps us with that in an efficient manner.
What I am confused is why for this particular case an algorithm such as the following is not correct but we have to use CDF.
Algorithm:

  1. Sum all elements of the array/stream of elements $w_n$ we want to sample and get the total sum.
  2. Keep a mapping of individual elements with their positions in the array/stream. This is required for duplicate entries
  3. Pick a random number in the range of $[0, total Sum)$
  4. Scan linearly the array and check for each element if the random number selected is less than or equal to $\frac{w_i}{total Sum}*100$. If yes then
    4.a if there is only 1 occurence of the element $w_i$ in the list return $i$
    4.b if there are $k$ occurences of the element $w_i$ in the list select a random number in the range of $[0, k)$ and return the corresponding index in the original array $w$

It seems to me that this algorithm takes into account each element's "weight" in the array, it takes into account duplicates and is random.

But I have verified via testing that it is wrong.
What is the error in reasoning that makes it impossible to use this and CDF is the only correct approach?

Update:
To be clear the expectation is that for e.g. the input array $[5,15,20,30,30]$ then index $0$ is picked $5\%$ of the time, index $1$ is picked $15\%$ of the time,$2$ is picked $20\%$ of the time, $3$ is picked $30\%$ of the time and $4$ is picked $30\%$ of the time. CDF solves this, I am trying to understand why the algorithm I suggested does not if it essentially takes the percentages into account in the linear scan

1

There are 1 best solutions below

8
On

The cited link asks for a method to generate a random integer $I\in\{0,1,\dots,n-1\}$ according to the distribution specified by a list of positive integer weights $w_0,\dots,w_{n-1}$, such that $P(I=j)=w_j/\sum_{i=0}^{n-1}w_i$.

The so-called "CDF method" is as follows:

Return the least $i$ such that $CDF(i)\ge U$, where $U\sim\text{Uniform$(0,1)$}$ and $CDF(i) = \sum_{j=0}^iP(I=j)= {\sum_{j=0}^{i}w_j\over\sum_{j=0}^{n-1}w_j}.$

NB: $U\sim\text{Uniform$(0,1)$}$ means that $U$ is a random variable with a uniform distribution on the interval $(0,1)$. In a computer implementation, it would be realized by a pseudorandom real in that interval, e.g. U=random() in Python.)

This works, because $U$ must fall between two consecutive CDF values, and the rule just picks out the argument $j$ of the larger value: $$\begin{align}P(\text{return $j$})&=P\bigg(\big(\text{least $i$ such that $CDF(i)\ge U$}\big)=j\bigg)\\[2ex] &=P\bigg(CDF(j-1)\lt U\le CDF(j)\bigg)\\[2ex] &=CDF(j) - CDF(j-1)\quad\text{because $U$ is Uniformly distributed}\\[2ex] &={w_j\over\sum_{i=0}^{n-1}w_i}\quad \text{as required.} \end{align}$$

Now, the algorithm you suggest has the following rule (assuming for now that the $w_i$ are distinct):

Return the least $i$ such that $R\le 100{w_i\over\sum_{j=0}^{n-1}w_j}$ where $R\sim\text{Uniform$(0,\sum_{j=0}^{n-1}w_j)$}$

But $R/\sum_{j=0}^{n-1}w_j\sim\text{Uniform$(0,1)$}$, so this is the same as

Return the least $i$ such that $100{w_i\over(\sum_{j=0}^{n-1}w_j)^2}\ge U,$ where $U\sim\text{Uniform$(0,1)$}$.

Thus, contrasting/comparing this rule to that of the CDF method, we can see that your method (in the case of distinct $w_i$) has just replaced $CDF(i)={\sum_{j=0}^{i}w_j\over\sum_{j=0}^{n-1}w_j}$ by the ad hoc function $100{w_i\over(\sum_{j=0}^{n-1}w_j)^2}.$ Of course this means there is no possibility for this method to work as required. Other problems are also apparent; for example, your method is not scale invariant (as the CDF method is): multiplying all the weights by the same positive number must not change the result.


Your title suggests that the CDF method is the only way to sample a discrete distribution, but see the following for examples and links to many other ways: How to generate numbers based on an arbitrary discrete distribution?.