Correcting the bias when sampling without replacement from finite domain

Question

Correcting the bias when sampling without replacement from finite domain

112 Views Asked by Bumbble Comm At 10 May 2026 - 6:59

Say I have a categorical distribution over the finite domain $1,..., n$.

My goal is to estimate $p(1)$, $p(2)$ up until $p(n)$ but I can only always draw $m$-many samples without replacement.

A naive approach would be to take $m$-many samples a lot of times and then count how often I encountered $1,..., n$ overall but this will obviously introduce a bias towards a more uniform distribution.

In the edge case of $m = n$, the estimate would indeed be 'fully' uniform.

How can I correct for the bias that sampling without replacement from a finite domain introduces?

Note: I cannot subsample from the $m$-many samples because that would be very inefficient.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2018-06-30 03:31:37

This isn't a full answer, partially because it's a little unclear what it would even mean to draw multiple samples without replacement and without an order.

Can you provide any more background about this problem?

Given your comment about the $m=n$ case, by "without replacement", it appears you mean that category can only show up once in a given sample of $m$. This is not the usual balls-and-urn case, which has different probabilities due to multiple balls of the same color, and only exhausts when all of the balls of a given color are used up.

If you did know the order there is a reasonable interpretation: The first of the $m$ draws in a sample is just distributed according to $p$, with $i$ having a probability of $p(i)$ being drawn. Subsequent draws in the same $m$ sample would then be according to an adjusted sampling where the already selected categories are forbidden, and the rest have probabilities equally proportional to the original $p(i)$. We can express this by dividing by the amount of probability "left": $(1 - p(x_1) - p(x_2) ...)$

Putting this in equations, drawing ${ x_1, x_2, ..., x_m}$ would happen with probability $$ \text{Pr}({x_i}) = \prod_{i=0}^m p(x_i)/\left(1 - \sum_{j=0}^{i-1} p(x_j) \right). $$

The probability that you draw $x_1$ is $p(x_1)$, then given that you can no longer draw $x_1$, the probability that you draw $x_2$ is $p(x_2)/(1-p(x_1))$, etc.

If you don't know the order, I suppose you could sum over all the permutations of the above expression, and it would have some meaning.

This isn't quite the expected numbers you call $q(i)$ above. Unfortunately, the easy ways of calculating this show that $q(i)$ depends on all of $p$, not just $p(i)$. We can imagine enumerating the probability that category $i$ was selected in any given position.

The chance of it being drawn first is easy: $p(i)$. The chance of it being drawn second already starts to run into trouble: $p(i) \cdot \sum_{j \neq i} p(j)/(1-p(j))$. If it weren't for the division, we could just replace this with $p(i) \cdot (1-p(i))$, and turn the whole thing into a geometric sum. But the division is required to make any sense of this sampling. And the next terms get even uglier from there, and I see no good way to simplify.

Correcting the bias when sampling without replacement from finite domain

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in SAMPLING

Trending Questions

Popular # Hahtags

Popular Questions