Algorithm for sampling items fails the tests

110 Views Asked by At

I was looking into a coding exercise which asks the following:

Given an array of positive integers $w$ where each $w_i$ describes the weight of the $i^{th}$ element, implement an algorithm that returns the index of the integer proportional to its weight in the array. Example: for input $[1, 9]$ when a number is picked out, the chance is that $9$ times out of $10$ we pick number $9$ as the answer. For $[1, 3]$ the probability of picking the index $0$ is $\frac{1}{1 + 3} = 25\%$ while the probability of picking the index $1$ is $\frac{3}{1 + 3} = 75\%$.

In a nutshell the probability of picking a number depends on the value of the number in proportion to the sum of all numbers

I implemented the following algorithm (this is not a programming question, so I'll keep it generic) which fails for $\frac{2}{3}$ of the test cases.
I am using coding syntax since I am not sure how I can accurately use formal notation for the purpose of my question.

class Item {  
  float probability;   
  int index;  
}  

// initialize  
int sum = 0;  
for(int i = 0; i < w.length; i++) {  
  sum += w[i];  
}

for(int i = 0; i < w.length; i++) {  
    chances = new Item(w[i]/(float) sum, i);  
}  
Arrays.sort(chances);   

At this point I have sorted the probabilities for each item.

And the actual method with the following logic:
I keep a counter of the calls to the method. And then based on the number of times called I do binary search to find the item with the closest probability.
I.e. if the total sum is $10$ when the function is called a second time the $counter$ will have value $2$ and hence I am looking for the item that has $\frac{2}{10}$ chance or the one highest. If we reach the total sum I reset the counter value.

int counter = 0;  
int index() {  
  ++counter;  
  float chance = counter/(float)sum;
  if(counter == sum) {  
     counter = 0;  
  }  
  int idx = Arrays.binarySearch(chances, new Item(chance, 0));  
  if(idx < 0) {  
     idx = -idx - 1;
  }  
  if(idx >= chances.length) {  
     idx = chances.length - 1;  
  }  
  return chances[idx].idx;  
}

What exactly is the flaw in my reasoning here?

UPDATE:
After the comments and @Joe's answer I also tried the following:

int index() {
 double rand = Math.random()*sum;   
 for(int i = 0; i < w.length; i++) {     
   if(Double.compare(rand, w[i]/(double)sum) <=0)   
         return i;  
  }  
  return w.length - 1; 
}  

Math.rand() returns a number in the range of $[0.0, 1.0)$ but this still fails.
Why is this wrong as well?

UPDATE 2:
I also tried the following algorithm:

// initialize  
int sum = 0;  
for(int i = 0; i < w.length; i++) {  
  sum += w[i];  
  set.put(w[i]).add(i);  
}

int index() {
     double probability = ThreadLocalRandom.current().nextInt(0, sum);   
     for(int i = 0; i < w.length; i++) {     
       if(probability <= (w[i]/(double)sum))*100) {  
          if(set.get(w[i].size() == 1) {   
             return i;  
          } 
          else { 
           int index = ThreadLocalRandom.current().nextInt(0, set.get(w[i]).size();  
           return set.get(w[i]).get(index);  
        }  
      }  
      return w.length - 1; 
    }  

Update 3:
What this algorithm (in Java) does:

  1. Declare a map with key the values of the original array and value a list with the indexes that the values appear. All the unique values will have a list of size 1 and all duplicate values will have a list with all the indexes they appear. E.g. for $[2, 3, 4, 4, 4, 1]$ the map has:
    2 => [0],
    3 => [1],
    4 => [2, 3, 4],
    1 => [5]
  2. Loop over the original array $w$ and sum the values and populate the map from (1).At the end of the loop sum is the total sum of the values in the array.
  3. When the index method is called:
    a) Pick an integer in the range of $[0, sum)$
    b) Linearly scan the original array and divide the current value $w_i$ with the sum and multiply by 100. If the random integer is less than or equal to $\frac{w_i}{sum}$ then
    • if the value appears in one location in the array return the current index $i$
    • else pick a random integer in the range of $[0, sizeOfListOfIndexes) and return that index

So basically I store for all elements the index, so that duplicate elements have a list of indexes the occur (i.e. for the case of $[3, 4, 4, 4, 1]$ so that we pick a random of the $4$.
This still fails e.g. for $[3, 14, 1, 7]$ the code consistently returns a long list of $0$ and $1$ only.
What am I messing up here?

1

There are 1 best solutions below

1
On BEST ANSWER

Desired specification for algorithm:

Given an input array of positive integers $x$, with length $n$, return an integer $i \in \{0, \dots, n-1 \}$ according to the probability distribution: $$P(i)=\frac{x_i}{\operatorname{sum}(x)}$$

First, I should note that many libraries have built in functions that do that, such as the numpy.random.choice function in Python's NumPy library.

numpy.random.choice(np.arange(len(x)), p=x/numpy.sum(x)) will meet the specification. The keyword argument p is for passing in the desired probabilities.

To code an algorithm from 'scratch', given only some random number generator that can generate a random integer from some range, with uniform probability, one possible algorithm is:

Given an input array of positive integers $x$

Let $c$ be the array of cumulative sums.

Let $s=\operatorname{sum}(x)$.

Choose an integer $r$ with uniform probability from 0 to $s-1$, inclusive.

Let $i$ be the count of the entries in $c$ that are less than or equal to $r$.

Return $i$.

This would return an index $i$ from 0 to $\operatorname{length}(x)-1$ in such a way that the probability is given by: $$P(i)=\frac{x_i}{\operatorname{sum}(x)}$$

For example, if the input array is $x=[2,6,9,3,1,3]$, then $c=[2,8,17,20,21,24]$ and $s=24$.

The algorithm would choose an integer $r$ with uniform probability from 0 to 23, inclusive.

If $r$ is 0 or 1, then $i=0$. (probability 2/24)

If $r$ is 2,3,4,5,6 or 7, then $i=1$. (probability 6/24)

If $r$ is 8,9,10,11,12,13,14,15 or 16, then $i=2$. (probability 9/24)

If $r$ is 17,18 or 19, then $i=3$. (probability 3/24)

If $r$ is 20, then $i=4$. (probability 1/24)

If $r$ is 21,22 or 23, then $i=5$. (probability 3/24)

The intuition for the algorithm comes from the concept of inverse transform sampling in probability theory. Given a probability distribution in the form of a cumulative distribution function $F(x)$, you can generate random variables according to that distribution by generating a uniform random variable $u$ from 0 to 1, then choosing the smallest $x$ such that $F(x) \ge u$.

For this problem, in which the distribution is over a finite set of values, the probabilities are the values in the array $x$, divided by $\operatorname{sum}(x)$ so that the total probability is 1.

The cumulative distribution function is just the cumulative sum of those probabilities.

If we use our RNG to generate a random integer $r$ from 0 to $\operatorname{sum}(x)-1$, then it already chooses each integer in that range with probability $1/\operatorname{sum}(x)$, so we don't need to divide by $\operatorname{sum}(x)$. We just need to find the smallest index $i$ such that $F(x_i)\ge r/\operatorname{sum}(x)$

We can actually find that index by counting the number of entries in the $c$ which are greater than or equal to $r$. If $i$ entries in $c$ are greater than or equal to $r$, then $F(x_i)\ge r/\operatorname{sum}(x)$ but $F(x_{i-1})< r/\operatorname{sum}(x)$ (or $i=0$)