I was looking into a coding exercise which asks the following:
Given an array of positive integers $w$ where each $w_i$ describes the weight of the $i^{th}$ element, implement an algorithm that returns the index of the integer proportional to its weight in the array. Example: for input $[1, 9]$ when a number is picked out, the chance is that $9$ times out of $10$ we pick number $9$ as the answer. For $[1, 3]$ the probability of picking the index $0$ is $\frac{1}{1 + 3} = 25\%$ while the probability of picking the index $1$ is $\frac{3}{1 + 3} = 75\%$.
In a nutshell the probability of picking a number depends on the value of the number in proportion to the sum of all numbers
I implemented the following algorithm (this is not a programming question, so I'll keep it generic) which fails for $\frac{2}{3}$ of the test cases.
I am using coding syntax since I am not sure how I can accurately use formal notation for the purpose of my question.
class Item {
float probability;
int index;
}
// initialize
int sum = 0;
for(int i = 0; i < w.length; i++) {
sum += w[i];
}
for(int i = 0; i < w.length; i++) {
chances = new Item(w[i]/(float) sum, i);
}
Arrays.sort(chances);
At this point I have sorted the probabilities for each item.
And the actual method with the following logic:
I keep a counter of the calls to the method. And then based on the number of times called I do binary search to find the item with the closest probability.
I.e. if the total sum is $10$ when the function is called a second time the $counter$ will have value $2$ and hence I am looking for the item that has $\frac{2}{10}$ chance or the one highest. If we reach the total sum I reset the counter value.
int counter = 0;
int index() {
++counter;
float chance = counter/(float)sum;
if(counter == sum) {
counter = 0;
}
int idx = Arrays.binarySearch(chances, new Item(chance, 0));
if(idx < 0) {
idx = -idx - 1;
}
if(idx >= chances.length) {
idx = chances.length - 1;
}
return chances[idx].idx;
}
What exactly is the flaw in my reasoning here?
UPDATE:
After the comments and @Joe's answer I also tried the following:
int index() {
double rand = Math.random()*sum;
for(int i = 0; i < w.length; i++) {
if(Double.compare(rand, w[i]/(double)sum) <=0)
return i;
}
return w.length - 1;
}
Math.rand() returns a number in the range of $[0.0, 1.0)$ but this still fails.
Why is this wrong as well?
UPDATE 2:
I also tried the following algorithm:
// initialize
int sum = 0;
for(int i = 0; i < w.length; i++) {
sum += w[i];
set.put(w[i]).add(i);
}
int index() {
double probability = ThreadLocalRandom.current().nextInt(0, sum);
for(int i = 0; i < w.length; i++) {
if(probability <= (w[i]/(double)sum))*100) {
if(set.get(w[i].size() == 1) {
return i;
}
else {
int index = ThreadLocalRandom.current().nextInt(0, set.get(w[i]).size();
return set.get(w[i]).get(index);
}
}
return w.length - 1;
}
Update 3:
What this algorithm (in Java) does:
- Declare a map with key the values of the original array and value a list with the indexes that the values appear. All the unique values will have a list of size 1 and all duplicate values will have a list with all the indexes they appear. E.g. for $[2, 3, 4, 4, 4, 1]$ the map has:
2 => [0],
3 => [1],
4 => [2, 3, 4],
1 => [5] - Loop over the original array $w$ and sum the values and populate the map from (1).At the end of the loop
sumis the total sum of the values in the array. - When the
indexmethod is called:
a) Pick an integer in the range of $[0, sum)$
b) Linearly scan the original array and divide the current value $w_i$ with thesumand multiply by 100. If the random integer is less than or equal to $\frac{w_i}{sum}$ then- if the value appears in one location in the array return the current index $i$
- else pick a random integer in the range of $[0, sizeOfListOfIndexes) and return that index
So basically I store for all elements the index, so that duplicate elements have a list of indexes the occur (i.e. for the case of $[3, 4, 4, 4, 1]$ so that we pick a random of the $4$.
This still fails e.g. for $[3, 14, 1, 7]$ the code consistently returns a long list of $0$ and $1$ only.
What am I messing up here?
Desired specification for algorithm:
Given an input array of positive integers $x$, with length $n$, return an integer $i \in \{0, \dots, n-1 \}$ according to the probability distribution: $$P(i)=\frac{x_i}{\operatorname{sum}(x)}$$
First, I should note that many libraries have built in functions that do that, such as the
numpy.random.choicefunction in Python's NumPy library.numpy.random.choice(np.arange(len(x)), p=x/numpy.sum(x))will meet the specification. The keyword argumentpis for passing in the desired probabilities.To code an algorithm from 'scratch', given only some random number generator that can generate a random integer from some range, with uniform probability, one possible algorithm is:
Given an input array of positive integers $x$
Let $c$ be the array of cumulative sums.
Let $s=\operatorname{sum}(x)$.
Choose an integer $r$ with uniform probability from 0 to $s-1$, inclusive.
Let $i$ be the count of the entries in $c$ that are less than or equal to $r$.
Return $i$.
This would return an index $i$ from 0 to $\operatorname{length}(x)-1$ in such a way that the probability is given by: $$P(i)=\frac{x_i}{\operatorname{sum}(x)}$$
For example, if the input array is $x=[2,6,9,3,1,3]$, then $c=[2,8,17,20,21,24]$ and $s=24$.
The algorithm would choose an integer $r$ with uniform probability from 0 to 23, inclusive.
If $r$ is 0 or 1, then $i=0$. (probability 2/24)
If $r$ is 2,3,4,5,6 or 7, then $i=1$. (probability 6/24)
If $r$ is 8,9,10,11,12,13,14,15 or 16, then $i=2$. (probability 9/24)
If $r$ is 17,18 or 19, then $i=3$. (probability 3/24)
If $r$ is 20, then $i=4$. (probability 1/24)
If $r$ is 21,22 or 23, then $i=5$. (probability 3/24)
The intuition for the algorithm comes from the concept of inverse transform sampling in probability theory. Given a probability distribution in the form of a cumulative distribution function $F(x)$, you can generate random variables according to that distribution by generating a uniform random variable $u$ from 0 to 1, then choosing the smallest $x$ such that $F(x) \ge u$.
For this problem, in which the distribution is over a finite set of values, the probabilities are the values in the array $x$, divided by $\operatorname{sum}(x)$ so that the total probability is 1.
The cumulative distribution function is just the cumulative sum of those probabilities.
If we use our RNG to generate a random integer $r$ from 0 to $\operatorname{sum}(x)-1$, then it already chooses each integer in that range with probability $1/\operatorname{sum}(x)$, so we don't need to divide by $\operatorname{sum}(x)$. We just need to find the smallest index $i$ such that $F(x_i)\ge r/\operatorname{sum}(x)$
We can actually find that index by counting the number of entries in the $c$ which are greater than or equal to $r$. If $i$ entries in $c$ are greater than or equal to $r$, then $F(x_i)\ge r/\operatorname{sum}(x)$ but $F(x_{i-1})< r/\operatorname{sum}(x)$ (or $i=0$)