Composition (Combinatorics) with upper bounds

55 Views Asked by At
Problem

There are $n$ balls numbered from 1 to $n$.
We uniformly sample $k$ balls without replacement.
Let $x$ be the number of sampled balls in a range 1..$m$.
I would like to efficiently sample $x$ with the minimal time and memory.

1 2 3 ..  m               n
o o o o o o | o o o o o o o 
  ^     ^         ^

Uniformly sampled k=3 balls. x=2.

About 'efficiently', I don't want to iterate all $n$ balls. Also generating $k$ indices with replacement check requires $O(k)$ memory, which is too large for a single machine.

Motiviation

I have a large dataset of size $n$ (10,000,000,000 rows).
The dataset is split to multiple files and stored in a distributed manner.
I want to sample $k$ rows (100~100,000,000) in parallel. They don't fit in the memory of a single machine.

If there are 5 files of sizes

sizes = [10, 5, 3, 8, 4]

I want to sample the number of samples in each file first

num_samples = [6, 2, 1, 2, 3]

where $\text{num_samples}[i] \le sizes[i]$ and $\sum \text{num_samples}[i] = k$.

Then I will send these numbers to each machine and let them sample from each file.
Sampled rows don't have to be returned to the original machine.

Similar problems

Composition - we have an upper bound on each number.

Simple random sampling without replacement of huge dataset - This is a single machine.

If there is no mathematically efficient method, a rejection sampling with reasonable expected number of rejections would be ok too.

1

There are 1 best solutions below

0
On BEST ANSWER

High-level Summary

I suspect there isn't a way to generate theoretically perfect samples, but also I think approximations will work well given your huge case size. Approximating only at the end (options 1 and 2 below) is way more work than the very simple option 3 and I don't see much reason to assume options 1 or 2 give any better results. Overall, my recommendation is just to implement option 3 and move on with your life, since that gives a "pretty good" approximation with minimal coding time/complexity. If you care a ton about accurate results then you could consider implementing both options 1 and 3 and comparing the results on your dataset.

Theoretical setup

In your example with sizes = [10, 5, 3, 8, 4], let's assign $k_b$ to be the number of items chosen from file $b$, with $1 \le b \le 5$. Let the total number of samples be $K = \sum_b k_b$. Similarly, let $n_b$ be the total number of available items in bucket $b$, and denote the total size of the dataset by $N = \sum_b n_b$.

Let's think about the distribution of $k_1$. If we can randomly select $k_1$ using the correct distribution then we can solve the full problem, since you could then apply the same method to choose $k_2$ knowing you need to select $K-k_1$ items from the remaining buckets, and so on.

There are $10$ items in bucket $1$ and $30$ items total, so $$P[k_1 = j] = \frac{\binom{10}{j} \binom{20}{K-j}}{\binom{30}{K}}$$

Those probabilities define the PDF for $k_1$. In theory we can use the probabilities above to perform a weighted random selection of $k_1$ and we're done. The distribution of $k_1$ is called hypergeometric.

Practical concerns

However, in practice there will be some extremely big numbers involved in those binomial coefficients, and the final probabilities will be extremely tiny, to the point where floating point arithmetic is going to get you in trouble. I can think of a couple of different ways to get a more tractable result. Either way, the goal is to build an approximate CDF for $k_1$ as an array of size $n_1 + 1$. Then you could sample from this distribution by choosing a random real number in $[0, 1]$ and binary searching in the CDF array.

Option 1: Approximate the PDF directly. You could use Stirling's approximation to estimate the binomial coefficients in the PDF. There'll still be lots of big exponents around, so you'll want to operate by computing $\log(P)$ instead of computing the probability $P$ directly. I get the following expression:

enter image description here

You'll want to compute $\log(P)$ instead of $P$ so that you can avoid computing extremely high exponents. Also, once you build your full PDF array, the approximation means it won't sum to exactly 1. (For $N=10^11, n_1=10^6, K=10^8$ I get a sum of around $0.9998$.) You should divide the whole array by its sum so that all of your probabilities will sum to 1.

Once you have the PDF array, you can just compute the CDF array using prefix sums and then sample $k_1$ using the CDF array.

Option 2: Approximate the CDF directly.

We can ask Mathematica to compute the CDF:

enter image description here

Next you'll want to approximate the CDF. For that you'll need to approximate the binomial coefficients (Stirling's approximation can still be used) but also HypergeometricPFQ. You can look up approximate methods but overall this method is likely more trouble than it's worth.

Option 3 (recommended): More drastic approximation steps

Look at this section from the Wikipedia article on hypergeometric distribution. Basically:

  • You can approximate a hypergeometric distribution using a binomial distribution.
    • This essentially amounts to saying the distribution of $k_1$ wouldn't change much if you sampled with replacement instead. This is basically valid if you're sampling only a small fraction of the data. This random SSE answer suggests using the approximation if the sample fraction is under $10\%$; people could disagree about the cutoff, but you should be fine since the ratio in your example is $\frac{1}{1000}$.
    • There is some literature on fast algorithms for sampling from huge binomial distributions. However I'd probably suggest using the second formula at that Wikipedia link, namely:
  • You can approximate a binomial distribution using a normal distribution.
    • This is basically valid if the numbers are big and $p = \frac{K}{N}$ is not too close to $0$ or $1$. People state lots of different heuristics for when $p$ is too big or small; for example, here's a random site that suggests the criteria $\frac{n_1 K}{N} > 5$ and $\frac{N-n_1K}{N} > 5$. Overall, for largeish $K$ this approximation should be fine. (If $K$ is small enough to fit comfortably on one machine then you could switch to more straightforward methods; here is one example.)