What is the importance of the inclusive range in Reservoir with Random Sort?

103 Views Asked by At

I am reading the Reservoir with Random Sort page on Wikipedia, and the algorithm says (copied):

/*
  S is a stream of items to sample, R will contain the result
  S.Current returns current item in stream
  S.Next advances stream to next position
  max-priority-queue supports:
    Count -> number of items in priority queue
    Maximum() -> returns maximum key value of all items
    Extract-Max() -> Remove the item with max key
    Insert(key, Item) -> Adds item with specified key
*/
ReservoirSample(S[1..?], R[1..k])
  H = new max-priority-queue
  while S has data
    r = Random(0,1)  // important: inclusive range
    if H.Count < k
      H.Insert(r, S.Current)
    else
      if H.Maximum > r
        H.Extract-Max()
        H.Insert(r, S.Current)
    S.Next

Notice the comment: important: inclusive range.

I want to implement a reservoir using Amazon's Redshift. The straightforward implementation is to generate a random number for each element, sort by that number and take the top k items.

The documentation for the Redshift random() function does not explicitely state whether 1.0 is included or not. The documentation for the PostgreSQL 9.4 random function() states that this function returns a random value in the range 0.0 <= x < 1.0. Redshift is a fork of PostgreSQL, and it makes sense to believe the same rule applies.

My question is thus, how important is it that 1.0 be included? If 1.0 is excluded, will it change sampling behaviour when I am sampling anywhere from 1k to 5M elements from a pool of 50M+ elements? Especially since I will take the top k items?