Sampling procedure to obtain groups with same categorical distribution

20 Views Asked by At

Imagine that we have $n$ (mutually-exclusive) sets of different sizes, $X_1, X_2, ..., X_n$. The total number of elements is $N = |X_1| + ... + |X_n|$. I want to partition these $N$ elements into groups of size $k < n$ where each group contains elements from different sets (i.e. each set occurs in each group at most once). We can assume that $N$ is divisible by $k$, so the number of groups is $g = N/k$. We can also assume that $|X_i| ≤ g$, so that it's possible to form such groups.

I'm looking for an (unbiased, efficient) sampling procedure that generates $g$ groups which are identically distributed, with that distribution being as close as possible to the categorical distribution with $p = (|X_1|/N, ..., |X_n|/N)$.

For a concrete example, for $n = 4$ and $k = 3$ with $|X_1| = 100, |X_2| = 100, |X_3| = 50, |X_4| = 50$, the $g = 100$ groups could be represented [1, 2, 3], [1, 2, 3], [2, 3, 4], [1, 2, 3], [1, 2, 4], ...

If we adopt the simple procedure of sampling without replacement, with the sets to compose each group chosen according to probability $p$ above, we will likely end up in a situation where the last few groups cannot be formed. I have tried this and it does indeed occur (with high probability for moderate-size problems).

Does an efficient such sampling procedure exist?