Simple random sample without replacement

1.3k Views Asked by At

I have a data file from which I wish to create a uniformly distributed simple random sample, without replacement. Will the following algorithm give me an unbiased result?

1 Set T = total number of records in the file.
2 Set S = number of samples required.
3 For each record in the file, in order:
  i    Set X = a random uniformly distributed number between 0 and 1.
  ii.  If X < S/T, select the record and decrement S.
  iii. Decrement T.
1

There are 1 best solutions below

0
On BEST ANSWER

Yes, the algorithm will give an unbiased random sample.

For clarity, I'll use $S$ and $T$ to denote the (constant) size of the sample and population respectively, and $s_i$, $t_i$ to denote the values of the variables after the $i^{th}$ iteration of the loop. $x_i$ will denote the $i^{th}$ realization of $X$.

First, let's nail down what it means to be an unbiased random sample. It's not sufficient to require that each element have the same probability of being included; every possible sample of size $S$ must have equal probability.

First claim: the algorithm will always output $S$ samples

We know that $S \leq T$ initially because the sample size cannot exceed the population. In fact, it's always true that $s_i \leq t_i$ because, going into the loop, either $s_i < t_i$ (in which case $s_i$ is still $\leq t_i$ after $t_i$ is decremented), or $s_i = t_i$ (in which case $x_i < s_i \mathbin{/} t_i = 1$, so $s_{i+1} = s_i - 1 = t_{i+1}$). So by induction, $s \leq t$ is an invariant of the loop.

Also, it's always true that $s_i \geq 0$, since if $s_i = 0$, $x_i > s_i \mathbin{/} t_i = 0$ and $s$ will not be decremented below $0$.

At the end of execution, since $t_T = 0$, $s_T = 0$ too. Since $s$ is decremented every time a record is chosen for the sample, the number of samples chosen will always be $S$, as claimed.

Second claim: every possible sample of $S$ elements is equally likely to be output by the algorithm

Let $U$ be an arbitrary sample of $S$ elements. We can calculate the probability of obtaining $U$ by taking the product of the probabilities of the random choices in the code at 3.ii necessary to produce the sample. For each record, this probability is $s_i \mathbin{/} t_i$ if the record is in the sample, or $(t_i - s_i) \mathbin{/} t_i$ if it is not. Let's simplify that to $g_U(i) \mathbin{/} t_i$ where $g_U(i)$ is defined as:

$g_U(i) = \begin{cases} s_i & \text{if the }i_{th}\text{ record is in }U\\ t_i - s_i & \text{if the }i_{th}\text{ record is not in }U \end{cases} $

Then the probability of the algorithm returning the set of samples $U$ can be written as

$$P(U) = \prod_{i=0}^{T} \frac{g_U(i)}{t_i}$$

Note that $t$ is decremented for every iteration, so

$$P(U) = \prod_{i=0}^{T} \frac{g_U(i)}{T-i} = \frac{1}{T!} \prod_{i=0}^{T} g_U(i)$$

The final, key insight is that while the function $g_U(i)$ depends on the sample $U$, the values of $g_U(i)$ are always permutations of the same list.

In the $S$ cases where the $i^{th}$ record belongs to the sample $U$, $g_U(i) = s_i$. These values are $\{S, S - 1, \ldots, 1\}$, so their product is $S!$

Likewise, in cases where the $i^{th}$ record does not belong to $U$, $g_U(i) = (t_i - s_i)$. Notice that $t - s$ decreases only when records do not belong to $U$. So the values are $\{T - S, T - S - 1, \ldots, 1\}$, and their product is $(T - S)!$.

$$\frac{1}{T!} \prod_{i=0}^T g_U(i) = \frac{S! (T - S)!}{T!} = \binom{T}{S}^{-1}$$

Therefore each of the $\binom{T}{S}$ possible $S$-length samples of $T$ records have an equal probability of occurring, as desired.