I am reading the Reservoir with Random Sort page on Wikipedia, and the algorithm says (copied):
/*
S is a stream of items to sample, R will contain the result
S.Current returns current item in stream
S.Next advances stream to next position
max-priority-queue supports:
Count -> number of items in priority queue
Maximum() -> returns maximum key value of all items
Extract-Max() -> Remove the item with max key
Insert(key, Item) -> Adds item with specified key
*/
ReservoirSample(S[1..?], R[1..k])
H = new max-priority-queue
while S has data
r = Random(0,1) // important: inclusive range
if H.Count < k
H.Insert(r, S.Current)
else
if H.Maximum > r
H.Extract-Max()
H.Insert(r, S.Current)
S.Next
Notice the comment: important: inclusive range.
I want to implement a reservoir using Amazon's Redshift. The straightforward implementation is to generate a random number for each element, sort by that number and take the top k items.
The documentation for the Redshift random() function does not explicitely state whether 1.0 is included or not. The documentation for the PostgreSQL 9.4 random function() states that this function returns a random value in the range 0.0 <= x < 1.0. Redshift is a fork of PostgreSQL, and it makes sense to believe the same rule applies.
My question is thus, how important is it that 1.0 be included? If 1.0 is excluded, will it change sampling behaviour when I am sampling anywhere from 1k to 5M elements from a pool of 50M+ elements? Especially since I will take the top k items?