Uniformly Random Tuples

857 Views Asked by At

Consider a multiset of natural numbers. As an example take

$$ M = \{1, 2, 2, 3, 3, 3\} $$

If we treat copies of the same number as indistinguishable, there are 8 distinct 2-tuples we can form from this, by not using any number more often than it appears in $M$:

$$ \{ (1,2), (2,1), (1,3), (3,1), (2,3), (3,2), (2,2), (3,3) \} $$

Is there an algorithm which generates one of these tuples with uniform probability, without having to generate all the tuples up front and selecting one at random? Note that the uniformity is in reference to the list of tuples, so $(3,3)$ should appear with the same probability as $(2,2)$ or $(1,2)$, respectively.

The algorithm should work for arbitrary non-empty $M$ and $n \leq |M|$, where the goal is to generate $n$-tuples.

Ideally, I'm looking for an algorithm that is not based on rejection, but if that is not possible, I'd also be interested in how I can minimise the number of necessary rejections to generate the tuples efficiently.

If special cases (i.e. small, fixed $n$) yield good algorithms, I'd still be interested in those, even if they don't generalise to larger $n$.

3

There are 3 best solutions below

0
On

Here, i think, is one way:

  1. Construct the probability generating function for ordered $n$-tuples of $M'$ (the set of distinct elements of $M$ mentioned by @AndreiRykhalski) with replacement (i.e., without constraints on the number of appearances of any element). The coefficients are the $n^\text{th}$ list of $|M|$-nomial coefficients, but the terms themselves are needed for the next step.
  2. Remove any impossible terms, i.e. with variables raised to powers larger than their counts in $M$. Concession: This might be prohibitively awkward. Depending on implementation, it might be done within step 1, so that you don't end up calculating huge generating functions. (Sorry, this involves rejection.)
  3. Sample from the remaining terms, weighted by their coefficients.
  4. Sample uniformly from the multiset permutations encoded by the selected term.

In your example, $M$ contains 3 distinct elements, so the generating function for ordered tuples of these elements has trinomial coefficients (step 1): $$(x+y+z)^2=x^2+2xy+2xz+y^2+2yz+z^2$$ The term $x^2$ corresponds to the impossible pair $(1,1)$, so delete it (step 2). The remaining pairs have total weight 8, so sample an integer between 1 and 8 (step 3). I get 8, so i pick the term $z^2$, which corresponds to $(3,3)$. Had i instead gotten 3, i would pick the term $2xz$ and have to choose between $(1,3)$ and $(3,1)$ (step 4).

I'm not a seasoned programmer, so i can't speak to the scalability of this solution.

4
On

Let the elements of the $k-$sized alphabet be indexed by $i=0,1 \cdots k-1$. Let $n$ be the tuple length, and let the vector ${\bf a}=(a_0,a_1,\cdots , a_{k-1})$ ($a_i\ge 0$) denote the available number of elements.

In the example: $n=2$, $k=3$, ${\bf a}=(1,2,3)$.

Let $N({\bf a},n)$ count the total number of tuples subject to the restriction.

Then an algorithm could be: for each $t=0 \cdots n-1$, for each $i=0 \cdots k-1$ compute $w_i = N({\bf a}^{[i]},n-t-1)$ where ${\bf a}^{[i]}$ is equal to ${\bf a}$ except for the $i-$th element, which is decreased by one. Then compute a random integer $s$ in $[0 \cdots k-1]$ with probabilities proportional to the weights $w_i$, assign this to the tuple $X_t:=s$ and update ${\bf a}:= {\bf a}^{[s]}$

To compute $N({\bf a},n)$ we can use the recursion:

$$N({\bf a},n)=\sum_{i=0 }^{k-1 }N({\bf a}^{[i]},n-1) $$

with $N({\bf a},0)=1$, and $N({\bf a},n)=0$ if ${\bf a}_i <0$ for some $i$.

If we care about performance, and if we plan to generate many tuples, we should precompute or cache the results of the above.

Sample Java code (non optimized) at Ideone

0
On

This is a variant of leonbloy's answer with the same basic idea: compute the number of tuples that begin with a given element, and use those numbers are weights in a random choice. If $n$ is small compared to $|M|$, this method is asymptotically much faster. However, it is not any faster than simply generating all possible tuples and choosing uniformly from them.

Let $C_i$ be the number of distinct elements that have multiplicity $i$ in $M$, and denote $c_i = |C_i|$. We may assume that $c_i = 0$ for $i > n$. Denote by $N^n_i(c_1, c_2, \ldots, c_n)$ the number of distinct $n$-tuples drawn from $M$ whose first entry is an element of $C_i$ (this number only depends on $n$, $i$, and the $c_j$). Then we have the recursion formula

$$ N^n_i(c_1, c_2, \ldots, c_n) = \left\{ \begin{array}{ll} 0, & \mbox{if } c_i = 0 \\ 1, & \mbox{if } n = 0 \\ c_i \sum_{j=1}^n N^{n-1}_j (d_1, \ldots, d_n) & \mbox{otherwise} \end{array} \right. $$

where $d_{i-1} = c_{i-1} + 1$ (if $i>1$), $d_i = c_i - 1$, and $d_j = c_j$ for all other $j$. The first two cases are self-explanatory, and the final case holds because we choose one element of $C_i$, which gives the multiplier $c_i$, and for the second slot of the tuple, we choose an element from one of the $j$ updated multiplicity classes, where $C_i$ has lost one element and $C_{i-1}$ gained one.

After the computation, you use $c_i^{-1} N^n_i(c_1, c_2, \ldots, c_n)$ as the weight for the elements of $C_i$, randomly choose an element of $M$, and repeat for the next slot.

As @leonbloy suggests, the tuples should be cached for performance. The number of tuples to keep in memory, and the time complexity of the algorithm, is at most $\sum_{\ell = 1}^n (\max \{ c_i : i = 1, \ldots, n \})^\ell$, which is $O(|M|^n)$ if $n$ is fixed. In @leonbloy's algorithm, this number seems to be of the order $n^k$ where $k$ is the number of distinct elements in $M$. If $n$ is close to $|M|$ and/or $k$ is small compared to $|M|$, @leonbloy's solution is probably faster. Note that $n \leq |M|$ holds in all nontrivial cases.