Selecting k distinct numbers from an array with increasing probability distribution

1k Views Asked by At

I have to select k distinct numbers from an array such that probability of a number getting selected is more if it is at the end of the array (probability increases linearly). I'm thinking of assigning probability to each number as

current_position_of_number_in_array/(n*(n+1)/2)

where n is the size of the given array. Then generate a random number between 0 and 1 and selecting the number in array which falls into that range. This will work fine if I have to select non-distinct numbers, but in case I have to select distinct numbers, this won't be efficient if size of k is of the order of n.(assuming I'm using some kind of hash-set to discard repeated numbers)

Are there better algorithms to select k distinct numbers which are of better big-oh complexity ?

2

There are 2 best solutions below

0
On BEST ANSWER

I finally ended up using weighted reservoir sampling with uniformly increasing weights.

Here is a detailed paper: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

Implementation in Java: http://utopia.duth.gr/~pefraimi/projects/WRS/

3
On

As I pointed out in a comment, you can choose a single number with linearly increasing probability by choosing two distinct numbers and using the larger one. (This gives probability $0$ for the least element, so you need to add a dummy least element to get exactly the probabilities you describe.)

To use this for efficiently selecting $k$ distinct elements with linear probabilities, set up an array originally containing all $n$ numbers. Then in each step, select $i$ uniformly randomly between $1$ and $n$ and $j$ uniformly randomly from the numbers remaining in the array. Repeat this until $j\gt i$. Use $j$ as the next number and remove it from the array (in constant time, by overwriting it with the last element of the array and decrementing the array length).

This should select all $n$ numbers in $O(n\log n)$ time. I don't have a rigorous proof for that off the top of my head, but it seems plausible from a coupon collector's perspective. The direct approach you described would take $O(n^2)$ time to select the $1$ if it's the last number left, and I think it would take $O(n^2\log n)$ overall. By contrast, this approach would take $O(n)$ time to select the $1$ if it's the last number left, and thus should take $O(n\log n)$ time overall.