Problem
There are $n$ balls numbered from 1 to $n$.
We uniformly sample $k$ balls without replacement.
Let $x$ be the number of sampled balls in a range 1..$m$.
I would like to efficiently sample $x$ with the minimal time and memory.
1 2 3 .. m n
o o o o o o | o o o o o o o
^ ^ ^
Uniformly sampled k=3 balls. x=2.
About 'efficiently', I don't want to iterate all $n$ balls. Also generating $k$ indices with replacement check requires $O(k)$ memory, which is too large for a single machine.
Motiviation
I have a large dataset of size $n$ (10,000,000,000 rows).
The dataset is split to multiple files and stored in a distributed manner.
I want to sample $k$ rows (100~100,000,000) in parallel. They don't fit in the memory of a single machine.
If there are 5 files of sizes
sizes = [10, 5, 3, 8, 4]
I want to sample the number of samples in each file first
num_samples = [6, 2, 1, 2, 3]
where $\text{num_samples}[i] \le sizes[i]$ and $\sum \text{num_samples}[i] = k$.
Then I will send these numbers to each machine and let them sample from each file.
Sampled rows don't have to be returned to the original machine.
Similar problems
Composition - we have an upper bound on each number.
Simple random sampling without replacement of huge dataset - This is a single machine.
If there is no mathematically efficient method, a rejection sampling with reasonable expected number of rejections would be ok too.
High-level Summary
I suspect there isn't a way to generate theoretically perfect samples, but also I think approximations will work well given your huge case size. Approximating only at the end (options 1 and 2 below) is way more work than the very simple option 3 and I don't see much reason to assume options 1 or 2 give any better results. Overall, my recommendation is just to implement option 3 and move on with your life, since that gives a "pretty good" approximation with minimal coding time/complexity. If you care a ton about accurate results then you could consider implementing both options 1 and 3 and comparing the results on your dataset.
Theoretical setup
In your example with
sizes = [10, 5, 3, 8, 4], let's assign $k_b$ to be the number of items chosen from file $b$, with $1 \le b \le 5$. Let the total number of samples be $K = \sum_b k_b$. Similarly, let $n_b$ be the total number of available items in bucket $b$, and denote the total size of the dataset by $N = \sum_b n_b$.Let's think about the distribution of $k_1$. If we can randomly select $k_1$ using the correct distribution then we can solve the full problem, since you could then apply the same method to choose $k_2$ knowing you need to select $K-k_1$ items from the remaining buckets, and so on.
There are $10$ items in bucket $1$ and $30$ items total, so $$P[k_1 = j] = \frac{\binom{10}{j} \binom{20}{K-j}}{\binom{30}{K}}$$
Those probabilities define the PDF for $k_1$. In theory we can use the probabilities above to perform a weighted random selection of $k_1$ and we're done. The distribution of $k_1$ is called hypergeometric.
Practical concerns
However, in practice there will be some extremely big numbers involved in those binomial coefficients, and the final probabilities will be extremely tiny, to the point where floating point arithmetic is going to get you in trouble. I can think of a couple of different ways to get a more tractable result. Either way, the goal is to build an approximate CDF for $k_1$ as an array of size $n_1 + 1$. Then you could sample from this distribution by choosing a random real number in $[0, 1]$ and binary searching in the CDF array.
Option 1: Approximate the PDF directly. You could use Stirling's approximation to estimate the binomial coefficients in the PDF. There'll still be lots of big exponents around, so you'll want to operate by computing $\log(P)$ instead of computing the probability $P$ directly. I get the following expression:
You'll want to compute $\log(P)$ instead of $P$ so that you can avoid computing extremely high exponents. Also, once you build your full PDF array, the approximation means it won't sum to exactly 1. (For $N=10^11, n_1=10^6, K=10^8$ I get a sum of around $0.9998$.) You should divide the whole array by its sum so that all of your probabilities will sum to 1.
Once you have the PDF array, you can just compute the CDF array using prefix sums and then sample $k_1$ using the CDF array.
Option 2: Approximate the CDF directly.
We can ask Mathematica to compute the CDF:
Next you'll want to approximate the CDF. For that you'll need to approximate the binomial coefficients (Stirling's approximation can still be used) but also HypergeometricPFQ. You can look up approximate methods but overall this method is likely more trouble than it's worth.
Option 3 (recommended): More drastic approximation steps
Look at this section from the Wikipedia article on hypergeometric distribution. Basically: