Why is CDF the only way to randomly select from samples?

Question

Why is CDF the only way to randomly select from samples?

263 Views Asked by Bumbble Comm At 27 Mar 2026 - 6:35

I am confused with the concept of cumulative distribution function (CDF).
I see it helps in algorithms related to sampling of data. So for instance if we have a list of values, and we randomly want to pick up some elements with probability dependent on their value, CDF helps us with that in an efficient manner.
What I am confused is why for this particular case an algorithm such as the following is not correct but we have to use CDF.
Algorithm:

Sum all elements of the array/stream of elements $w_n$ we want to sample and get the total sum.
Keep a mapping of individual elements with their positions in the array/stream. This is required for duplicate entries
Pick a random number in the range of $[0, total Sum)$
Scan linearly the array and check for each element if the random number selected is less than or equal to $\frac{w_i}{total Sum}*100$. If yes then
4.a if there is only 1 occurence of the element $w_i$ in the list return $i$
4.b if there are $k$ occurences of the element $w_i$ in the list select a random number in the range of $[0, k)$ and return the corresponding index in the original array $w$

It seems to me that this algorithm takes into account each element's "weight" in the array, it takes into account duplicates and is random.

But I have verified via testing that it is wrong.
What is the error in reasoning that makes it impossible to use this and CDF is the only correct approach?

Update:
To be clear the expectation is that for e.g. the input array $[5,15,20,30,30]$ then index $0$ is picked $5\%$ of the time, index $1$ is picked $15\%$ of the time,$2$ is picked $20\%$ of the time, $3$ is picked $30\%$ of the time and $4$ is picked $30\%$ of the time. CDF solves this, I am trying to understand why the algorithm I suggested does not if it essentially takes the percentages into account in the linear scan

Original Q&A

There are 1 best solutions below

**Brian Hill** · Answer 1 · 2021-05-18 02:20:14

The cited link asks for a method to generate a random integer $I\in\{0,1,\dots,n-1\}$ according to the distribution specified by a list of positive integer weights $w_0,\dots,w_{n-1}$, such that $P(I=j)=w_j/\sum_{i=0}^{n-1}w_i$.

The so-called "CDF method" is as follows:

Return the least $i$ such that $CDF(i)\ge U$, where $U\sim\text{Uniform$(0,1)$}$ and $CDF(i) = \sum_{j=0}^iP(I=j)= {\sum_{j=0}^{i}w_j\over\sum_{j=0}^{n-1}w_j}.$

NB: $U\sim\text{Uniform$(0,1)$}$ means that $U$ is a random variable with a uniform distribution on the interval $(0,1)$. In a computer implementation, it would be realized by a pseudorandom real in that interval, e.g. U=random() in Python.)

This works, because $U$ must fall between two consecutive CDF values, and the rule just picks out the argument $j$ of the larger value: $$\begin{align}P(\text{return $j$})&=P\bigg(\big(\text{least $i$ such that $CDF(i)\ge U$}\big)=j\bigg)\\[2ex] &=P\bigg(CDF(j-1)\lt U\le CDF(j)\bigg)\\[2ex] &=CDF(j) - CDF(j-1)\quad\text{because $U$ is Uniformly distributed}\\[2ex] &={w_j\over\sum_{i=0}^{n-1}w_i}\quad \text{as required.} \end{align}$$

Now, the algorithm you suggest has the following rule (assuming for now that the $w_i$ are distinct):

Return the least $i$ such that $R\le 100{w_i\over\sum_{j=0}^{n-1}w_j}$ where $R\sim\text{Uniform$(0,\sum_{j=0}^{n-1}w_j)$}$

But $R/\sum_{j=0}^{n-1}w_j\sim\text{Uniform$(0,1)$}$, so this is the same as

Return the least $i$ such that $100{w_i\over(\sum_{j=0}^{n-1}w_j)^2}\ge U,$ where $U\sim\text{Uniform$(0,1)$}$.

Thus, contrasting/comparing this rule to that of the CDF method, we can see that your method (in the case of distinct $w_i$) has just replaced $CDF(i)={\sum_{j=0}^{i}w_j\over\sum_{j=0}^{n-1}w_j}$ by the ad hoc function $100{w_i\over(\sum_{j=0}^{n-1}w_j)^2}.$ Of course this means there is no possibility for this method to work as required. Other problems are also apparent; for example, your method is not scale invariant (as the CDF method is): multiplying all the weights by the same positive number must not change the result.

Your title suggests that the CDF method is the only way to sample a discrete distribution, but see the following for examples and links to many other ways: How to generate numbers based on an arbitrary discrete distribution?.

Why is CDF the only way to randomly select from samples?

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in PROBABILITY-THEORY

Related Questions in PROBABILITY-DISTRIBUTIONS

Related Questions in CUMULATIVE-DISTRIBUTION-FUNCTIONS

Trending Questions

Popular # Hahtags

Popular Questions