How to generate balanced combinations?

477 Views Asked by At

Let's say I have $T$ objects I want to put into combinations of size $N$ (lets say 4).

So, I start iterating through them (in dictionary order):

A, B, C, D
A, B, C, E
A, B, C, F
...

This has one big problem: There are a lot of possible combinations. I don't have time for that. So, I select only the first $Y$ combinations.

However, this has another problem: Object $A$ is going to appear far more often than any other object. I want to ensure that all items appear as evenly as possible.

So fo trial number 2, I simply take the next N items:

A, B, C, D
E, F, G, H
I, J, K, L
...
A, B, C, E (we skip to E this time around)
F, G, H, I
J, K, L, M

This is looking much better. However, there's another problem: The pairing $(A, B)$ will appear far more often than the pairing $(A, J)$

Therefore, what I really want is an ordering on the combinations where taking the first $Y$ combinations will produce a balanced number of subcombinations.

  • A subcombination is a strict subset of a combination
  • Balanced means that the difference between the number of identical subcombinations must be as small as possible. (e.g. The number of $(A, J)$ pairs must be as close as possible to the number of $(A, B)$ pairs)

Is this possible? Is there an easy algorithm to generate this ordering?

1

There are 1 best solutions below

7
On BEST ANSWER

Low Discrepancy Sequences

Yes, it is possible to sample the combinations in a manner that is very evenly balanced.

Point distributions that are more even than 'uniform random sampling' are called low discrepancy sequences (also called quasirandom sequences). See wikipedia article "low discrepancy sequences" for a brief overview on these sequences. There are numerous ways to construct low discrepancy sequences, including Van der Corput/Halton, Kronecker/Weyl, Niederreiter, and Sobol to name a few. Furthermore, all of these can be used to create $d$-dimensional sequences.

The figure below shows a comparison of various low discrepancy sequences in two-dimensions. The important feature to note is how all of the distributions are more even than the random one, and have less 'gaps' and less 'clusters'.

enter image description here

The method below shows how we can exploit this evenness to select a sequences of subsets that are well balanced.

Low Discrepancy Combinatorics

However, for this particular question, we are not interested in point distributions in $n$-dimensional space, but rather subsets of $k$ items, each selected out of a population of $n$ objects. That is,

Given a set of $n$ objects, $\{S: x_i$ for $1 \leq i \leq n \} $ and an integer $k, \; (1\leq k \leq n)$,

We show how to create an open sequence $\{s_0, s_1,s_2, s_3, \ldots \} $ of subsets, each of size $k$, such that for all values of $m>1$ and $j \; (1 \leq j \leq k)$, the amongst first $m$ terms of the sequence, all possible $j$-sets are distributed as evenly as possible.

That is, we show how to sample in a manner that at all times throughout the experiment, we maintain:

  • similar frequencies of all possible $\{x_i\}$,
  • similar frequencies of all possible 2-sets of $\{x_i,x_j\}$;
  • ...
  • similar frequencies of all possible $k$-sets.

Although the method below works with the use of any low discrepancy sequences it is only optimal for the Sobol and Niederreiter sequences due to their intrinsic relationship with binary sequences, and especially to a special characteristic that Sobol termed, "Property A".

Thus, the method can be formally described as follows

Methodology for constructing an even subset selection

  1. Let i=1.
  2. Calculate the $i$-th term in the $n$-dimensional Sobol sequence
  3. Convert this to a binary vector, $b_i$.
  4. If $b_i$ has exactly $k$ ones in it, then this represents a subset that has exactly $k$ items, and so let this be the next subset; otherwise reject this $b_i$.
  5. $i \leftarrow i+1$.
  6. Goto step 2, until number of desired subsets is achieved.

In many cases, this method requires rejecting a modest fraction of subsets. However, as most computers can still typically scan through several million terms in under a second, an acceptance rate of only a few percent, is most likely not an issue for most people. In practice, this means that this method works well for situations where $d \lesssim 20$.

An example

To illustrate, we describe the how to select a few hundred samples, with each sample having exactly $k=4$ objects out of a possible $N=9$ objects. It should be clear from how this extends to arbitrary $n$ and $k$.

First we calculate the $n=9$-dimensional Sobol sequence. This is usually available via most statistical software libraries. Typically each software implementation varies slightly as there are various choices for key parameters that need to be made. Thus if your are trying to replicate this method, the values in the sequence that you construct may differ slightly, but the final results will still hold.

Using Mathematica, the first few terms of the $n=9$-dimensional Sobol sequence are:

$$ t_0 = (0.380, 0.462, 0.0743, 0.568, 0.933, 0.688, 0.304, 0.935, 0.867)$$ $$ t_1 = (0.880, 0.962, 0.574, 0.0679, 0.433, 0.188, 0.804, 0.435, 0.617)$$ $$ t_2 = (0.130, 0.712, 0.324, 0.318, 0.683, 0.438, 0.0537, 0.685, 0.117)$$ $$ t_3 = (0.630, 0.212, 0.824, 0.818, 0.183, 0.938, 0.554, 0.185, 0.0854)$$ $$ t_4 = (0.161, 0.368, 0.918, 0.662, 0.402, 0.0313, 0.772, 0.216, 0.585)$$ $$ t_5 = (0.661, 0.868, 0.418, 0.162, 0.902, 0.531, 0.272, 0.716, 0.835)$$ $$ t_6 = (0.411, 0.618, 0.668, 0.412, 0.152, 0.781, 0.522, 0.466, 0.335)$$ $$ t_7 = (0.911, 0.118, 0.168, 0.912, 0.652, 0.281, 0.0225, 0.966, 0.460)$$ $$ t_8 = (0.286, 0.993, 0.793, 0.287, 0.527, 0.156, 0.397, 0.0914, 0.960)$$ $$ t_9 = (0.786, 0.493, 0.293, 0.787, 0.0272, 0.656, 0.897, 0.591, 0.710)$$ $$ t_{10} = (0.036,0.243,0.543,0.537,0.777,0.906,0.147,0.341,0.210)$$ $$ \ldots $$

Now convert each of the components in each term, into a binary $0/1$ value via the Round[.] function. This, results in the following sequence $$ b_0 = (0,0,0,1,1,1,0,1,1)$$ $$ b_1 = (1,1,1,0,0,0,1,0,1)$$ $$ b_2 = (0,1,0,0,1,0,0,1,0)$$ $$ b_3 = (1,0,1,1,0,1,1,0,0)$$ $$ b_4 = (0,0,1,1,0,0,1,0,1)$$ $$ b_5 = (1,1,0,0,1,1,0,1,1)$$ $$ b_6 = (0,1,1,0,0,1,1,0,0)$$ $$ b_7 = (1,0,0,1,1,0,0,1,0)$$ $$ b_8 = (0,1,1,0,1,0,0,0,1)$$ $$ b_9 = (1,0,0,1,0,1,1,1,1)$$ $$ b_{10} = (0,0,1,1,1,1,0,0,0)$$ $$ \ldots$$

The key part of the method now is to understand that each of the binary vectors defines a subset. For example the first term $ b_0 = (0,0,0,1,1,1,0,1,1)$, corresponds to selecting only the 4th, 5th, 6th, 8th and 9th (reading from left to right).

That is, the subset {DEFHI} if the objects were the letters of the alphabet. However, this subset has five items not 4, so reject this sample. Similarly for $b_1,b_2,b_3$ and $b_4$.

However, $ b_4 = (0,0,1,1,0,0,1,0,1)$ corresponds to $\{CDGI\}$ which has the exactly the prescribed number of items, so this accept this as the first sample.

Thus the first few (accepted) terms are: $\{b_4,b_6,b_7,b_8,...\} $ $$= \{ CDGI, BCFG, ADEH, BCEI, CDEF, ABGH, DFGI, BEFI, BDHI, EGHI,... \}$$

If we follow this method for our particular case of $n=9,k=4$, the first $m=2000$ terms of the Sobol sequence will result in 489 acceptable subsets of length $k=4$.

Empirical Results

We can verify that for our example, this produces evenly combinations by listing the frequency counts for various tuples.

The sorted frequency of each of the 11 different $x_i$ is, $$ 218, 218, 218, 218, 218, 219, 219, 219, 221$$

The sorted frequency of each of the 36 different pairs $\{x_i,x_j\}$ is, $$ 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 83, 83, 83, 83, 83, 83, 83, 83, 84$$

The sorted frequency of each of the 84 different 3-sets $\{x_i,x_j,x_k\}$ $$ 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, \ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, \ 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, \ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, \ 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24$$

And finally, the sorted frequency of each of the 126 different 4-sets is: $$3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, \ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, \ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, \ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, \ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, \ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4 $$

Special values of $m$

The beautiful characteristic of Sobol sequences (more precisely Property A sequences) is that for all multiples of $2^d = 512$ terms, the above subset frequency counts will be exactly even, as each $j$-tuple will occur exactly $\binom{n-j}{k-j}$ times.

That is, if we had the freedom to select a fixed $m$ in advance then choosing $m=512$, (which would result in $\binom{n}{k}$ accepted terms) would then be the obvious ideal choice.

However, the main contribution of this post, is to show that this method produces excellent results for arbitrary values of $m$. Said another way, this method describes how to construct an open-ended sequence that produces very even $j$-tuple frequency distributions for all $(1\leq k\leq n$), for all values of $m=1,2,3,.....$ .

Comparison with Uniform Random Sampling

Obviously these frequency counts are very even. However, to show how much this is an improvement over simple uniform random sample, the follow frequency counts are given for a corresponding sample of 490 randomly sampled 4-sets.

The frequency of each of the 9 different $x_i$ is, $$196, 196, 200, 208, 216, 220, 224, 225, 235$$

The frequency of each of the 36 different pairs $\{x_i,x_j\}$ is, $$ 66, 67, 70, 70, 70, 71, 72, 72, 72, 73, 73, 74, 74, 75, 75, 76, 77, \ 78, 78, 79, 80, 82, 84, 85, 85, 86, 86, 88, 89, 90, 90, 91, 92, 93, \ 96, 101$$

The frequency of each of 84 the different sets $\{x_i,x_j,x_k\}$ $$ 14, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, \ 19, 19, 19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, \ 21, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, \ 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, \ 27, 27, 28, 28, 28, 28, 30, 30, 30, 30, 30, 31, 32, 33, 34, 36$$

And finally, the frequency of each of the 126 different 4-sets is: $$0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, \ 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, \ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, \ 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, \ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, \ 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 10 $$

Summary

The low discrepancy Sobol sequence can be used to construct a sequence of subsets such that the cumulative frequency of all possible subsets of variables is evenly distributed.