Fixed-rate sampling without replacement

110 Views Asked by At

Suppose that we have a population of $N$ elements $E=\{e_1,e_2,...,e_N\}$, and a corresponding set of desired sampling probabilities $P=\{p_1,p_2,...,p_N\}$. Each element $e_i\in E$ should be sampled with a probability $p_i\in P$. I want to sample $K, K<N$ elements without replacement (i.e., the same element cannot be chosen more than once) while meeting the desired sampling probabilities $P$.

As an example, I have three elements E = {e1,e2,e3}. I'd like to sample them with a probability of $\{p1= 0.98, p2=0.01, p3= 0.01\}$. Let's say I would like to sample a pair (K=2). I can only sample one pair out of {e1,e2}, {e1,e3} and {e2,e3}. The algorithm should output a single pair every time, where e1 would be sampled at probability 0.98, while e2 and e3 would be sampled at probability 0.01.

How can I do that?

3

There are 3 best solutions below

0
On

Here we use a naive algorithm to sample the elements one by one.

Denote $\mathcal{K}_j$ be the set of elements sampled without replacement at the $j$-th round, such that $$\varnothing = \mathcal{K}_0 \subset \mathcal{K}_1 \subset \ldots \subset \mathcal{K}_K$$

and $|\mathcal{K}_j|= j$. We begins with round $j = 1$

Sort the $N$ probabilities in ascending order, and relabel them as $p(i, j)$ such that $$ p(1, 1) \leq p(2, 1) \leq \ldots \leq p(N, 1)$$ and relabel the elements $e(i, j)$ accordingly.

And we can keep $\sigma(i, j)$ be the original index as needed, such that $$ p(i, j) = p_{\sigma(i, j)}, e(i, j) = e_{\sigma(i, j)}$$

Now begin the sampling part

  1. Generate $U_j \sim \text{Uniform}(0, 1)$ independently (independent from other rounds)

  2. Set $\displaystyle S_j = 1 - \sum_{l = 1}^{j-1} p^{(l)} $ and $V_j = U_jS_j $. Here $p^{(l)}$ are the corresponding probabilities of the elements sampled at the round $l$ before the current round $j$. Set the sum to be $0$ for $j = 1$. $S_j$ is the sum of probabilities that are not sampled before sampling the $j$-th round

  3. Set $m = 1$

  4. Note that there are $N - j + 1$ elements not sampled before sampling the $j$-th round. If $V_j$ satistfy $$S_j - \sum_{i=1}^m p(N-j+2-i,j) < V_j < S_j - \sum_{i=1}^{m-1} p(N-j+2-i,j)$$ then the element $e(N-j+2-m, j)$ is sampled; By sampled it means we set $$ \mathcal{K}_j = \mathcal{K}_{j-1} \cup e(N-j+2-m, j)$$ Otherwise increase $m$ by $1$ and repeat step $4$ until sampled.

  5. After successfully sampling an element in this $j$-th round, record $p^{(j)} = p(N - j + 2 - m, j)$ be the sampled probabilities, which going to be removed.

For the remaining probabilities, set $$ p(i, j+1) = \begin{cases} p(i, j) & \text{when} & 1 \leq i < N - j + 2 - m \\ p(i+1, j) & \text{when} & N - j + 2 - m \leq i \leq N - j \end{cases}$$ Essentially it relabeled the probabilities in next round with same $i$ index for those not in the loop in step $4$, and shift down $1$ for those in the loop of step $4$. Similarly relabel those $e(i, j+1)$ and $\sigma(i, j+1)$ as needed.

We also record $p^{(j)} = p(N - j + 2 - m, j)$ as the sampled

  1. Increase $j$ by $1$ and repeat from step $1$, until $j = K$ before reaching this step, which indicate the sampling with $K$ elements are finished

And at the end $\mathcal{K}_K$ will contains all the sampled elements.

Actually the algorithm is very naive and simple but hard to present without proper notations. See if this fulfill the "Fixed-rate sampling" requirement.

1
On

Your requirements are not always feasible.

For $K=1$ it is always possible, but for $K>1$ you will run into inconsistencies.

Take your example $\{.98,.01,.01\}$

If $K=3$ we only have one choice so we cannot get the desired marginal probabilities.

Let $p_{i,j}:=P\left(\{e_i,e_j\}\right)$

For $K=2$ we have

$.98 = p_{1,2} + p_{1,3}$

$.01 = p_{1,2} + p_{2,3}$

$.01 = p_{1,3} + p_{2,3}$

This can be cast as a matrix algebra problem:

$$\mathbf{A}\mathbf{p_{ij}} = \mathbf{p_i}$$

Where

$$\mathbf{p_{ij}} = (p_{1,2}, p_{1,3}, p_{2,3})$$ $$A = \left[ \begin{matrix}1,1,0\\1,0,1 \\ 0,1,1\end{matrix}\right]$$ $$\mathbf{p_i} = (.98,.01,.01)$$

Then we get $$A^{-1} = \left[ \begin{matrix}.5,.5,-.5\\.5,-.5,.5 \\ -.5,.5,.5\end{matrix}\right]$$

This implies

$$\mathbf{p_{ij}} = \left[ \begin{matrix}.5,.5,-.5\\.5,-.5,.5 \\ -.5,.5,.5\end{matrix}\right] \left[\begin{matrix}.98\\.01\\.01\end{matrix}\right] = \left[\begin{matrix}.49\\.49\\-.48\end{matrix}\right] $$

But this requires a negative probability! Therefore, there doesn't exist a set of probabilities that meet your example criteria for $K=2$.

0
On

Your "target sample probabilities" $P$ need to add up to $K$, not $1$. To see this in your $N=3$, $K=2$ example, let $p_{ij}$ be the probability of getting the (unordered) sample $\{e_i,e_j\}$. Then we have the following:

  • $p_{12}+p_{13}=p_1$
  • $p_{12}+p_{23}=p_2$
  • $p_{13}+p_{23}=p_3$
  • $p_{12}+p_{13}+p_{23}=1$

Add up the first three equations, then use the last equation to show $2=p_1+p_2+p_3$.

Letting $\sum_P=2$ and following User5678's answer, you get the following solution:

  • $p_{12}=1-p_3$
  • $p_{13}=1-p_2$
  • $p_{23}=1-p_1$

Unfortunately, I don't have a general solution for all $N$ and $K$.