Sample a specific number of elements from a list with matching probabilities

100 Views Asked by At

Say I have a list of $n$ elements, $x=[x_1,\dots,x_n]$, and a list of $n$ probabilities $p=[p_i]$. I want to sample $c$ elements of the list without replacement, where $1<c<n$, and I want the probability that my sample contains $x_i$ to be $p_i$. This in turn means $\sum_i p_i=c$. Note that the $p_i$ are indeed probabilities, not weights: if $p_7=1$ then every sample must contain $x_7$.

For $n=3$, $c=2$ this problem is over-constrained and has a solution (let $p_{ij}$ be the probability of the specific sample $\{x_i,x_j\}$, then solve $\sum_{i,j}p_{ij}=1$, $p_{12}+p_{13}=p_1$, etc. Alternatively, notice that "pick a sample of two" is the same as "pick a sample of one and then choose the other two" and hence $p_{12}=1-p_3$, etc). This generalizes to $c=n-1$.

For $n=4$, $c=2$, the above system is under-constrained: a valid solution could have e.g. $p_i>0$ but also $p_{24}=p_{34}=0$.

My questions are:

  1. Is there an existing algorithm for this sort of sampling, which has nice properties and avoids ruling out specific samples unnecessarily?
  2. If not, are there any extra constraints I could put on the $n>c+1$ cases to force a unique solution?
    • Ideally this would be a constraint which pushes $p_{ij}$ to be closer to uniform or independent, however $p_{ij}=p_i p_j$ is too strong. Are there other obvious choices?
1

There are 1 best solutions below

8
On

Edit: This does not always work, because sometimes it produces negative probabilities.

Here is a solution which seems pretty nice. For any $S\subseteq \{1,\dots,n\}$, where $|S|=c$, the probability of choosing $S$ is as follows: $$ \mathrm P(S) =\frac1{(n-c)\binom{n-1}{c-1}} \left({(n-c)\sum_{i\in S}p_i-(c-1)\sum_{j\notin S}p_j}\right) $$ For example, the $n=4,c=3$, so that $n-c=\color{blue}{1}$ and $c-1=\color{red}2$, then $$ \mathrm P(1,2,4)=\frac{\color{blue}1\cdot p_1+\color{blue}1 \cdot p_2-\color{Red}2\cdot p_3+\color{blue}1\cdot p_4}{3} $$ When $n=5,c=2$, $$ \mathrm P(2,5)=\frac{-p_1+3p_2-p_3-p_4+3p_5}{12} $$ When $n=5,c=3$, $$ \mathrm P(1,3,5)=\frac{2p_1-2p_2+2p_3-2p_4+2p_5}{12} $$