Rendering Bernoulli sampling by the binomial distribution

114 Views Asked by At

I want to characterize the simple algorithm for selecting, say, 20% of a discrete set when the total size of the set is not known. For each item:

if (random mod 100) < 20
   select item
end

Does that algorithm have a name? Is it discussed somewhere? The only hint I found is that the probability of ending up with $k$ items after examining $n$ of them can be calculated using the binomial distribution as: $$f(n, k) = \binom{n}{k}p^k(1-p)^{n-k}$$ where $p=\frac{20}{100} = 0.2\quad$ in this example. I made two attempts to visually render how $f$ varies with $n$. As I'm not well-prepared in statistics, I fear they are rather outlandish. How can I improve them?

For a first attempt, I can define that $k$ is acceptable if $k$ rounded to a digit less than the order of magnitude of $n$ is equal to $p$. With this position one can compute: $$ k_{low} = \lfloor (p - 0.05 ) \cdot n \rfloor, $$ $$ k_{high} = \lfloor (p + 0.05) \cdot n \rfloor -1 \text{, and}$$ $$ \text{probability that k is acceptable} = \sum_{k=k_{low}}^{k_{high}} \binom{n}{k}p^k(1-p)^{n-k}$$

That way one can compile a table like the following, which may look quite characterizing:

n low k high k probability that k is acceptable
10 1 2 0.570425
100 15 24 0.788203
1000 150 249 0.999913
10000 1500 2499 1.000000

For a second attempt, I plotted the values of $f(n,k)$, scaling the interval $[0, n]$ in abscissa to the unit interval $[0, 1]$, and multiplying $f(n, k)$ by $n/10$ in ordinate, so as to compensate for shrinking the abscissa. That way, the "integral" of the curve appears to be preserved. For high values of $n$ I compute the average of a few $k$'s. Since the results are rather high, I use a logarithmic scale.

Binomial distribution for p=0.2

I'd like to expand a wikipedia article with such a characterization. However, the explanations above smell of original research. Are there references that justify them?

2

There are 2 best solutions below

2
On

Thank you for showing what you have tried. However, honestly, I have no clear idea from your explanation and example what you are trying to do, so my speculation below as to one possibly useful path may be more comment than answer.

If this is approximately what you're trying to do, maybe you can build on it. If not, maybe it will help you describe what you do want.

One interpretation is that you are trying to sample $n$ observations from a Bernoulli population with $P(X = 0) = 1-p, P(X = 1) = p,$ with $p = 0.2.$

In R statistical software, this can be done for $n = 1000$ as follows:

set.seed(2021)  # for reproducibility
x = rbinom(1000, 1, .2)

Among the 1000 elements of the resulting vector x, there happen to be $204$ 1s and $796$ 0s; that is about 20%.

table(x)
    x
  0   1 
796 204 

If you regard these 1000 observations as data, you can use them to make a 90% confidence interval for the proportion $p$ of 1s. There are many styles of confidence intervals. According to Wikipedia on 'binomial confidence intervals', the Jeffries CI has good properties (actually 'covering' the true $p$ the 'promised' 90% of the time). Using R, it is also very easy to compute Jeffries CIs. For my sample simulation run, the 90% CI is $(0.171, 0.213),$ which does happen to include $p = 0.20.$

CI.90 = qbeta(c(.05,.95), 192.5, 808.5); CI.90
[1] 0.1721825 0.2131322

Notes: (1) Although the Jeffries interval has excellent frequentist properties, it is based on a Bayesian argument using the Jeffries non-informative prior distribution $\mathsf{Beta}(.5,.5).$

(2) Everything I have shown here involves well-known and often-used ideas of probability and statistics. So the phrase 'original research' does not apply.

0
On

Perhaps the definition of an acceptable sample size $k$ as given in the question is rather unusual. Looking at the practice of confidence intervals, it seems more usual to first define a fixed error bound and then determine what is the minimum value of $n$ such that samples selected using that algorithm have sizes within the error bounds in 95% of the cases.

For given values of $n$ and $p$, the sizes that fit within the bounds are the set K: $$ K_{n, p} = \left\{ k \in \Bbb N: \left\vert \frac kn - p \right\vert < error \right\}$$

By the binomial distribution, the probability that a size is in that set is: $$\sum_{k \in K} \binom{n}{k}p^k(1-p)^{n-k}$$

One can use the bisection method to find the lowest integer value of $n$ that makes that sum not less than 0.95. That way one can get the picture below:

confidence of Bernoulli sampling

For $p = 0.0$ and $p = 1.00$ the algorithm delivers exact results for all $n$'s. The $p$'s in between are shown. Note that $error=0.005$ is an order of magnitude smaller than the one in the question. If $100p$ is an integer percentage, $|k - np| < 0.005\cdot n$ guarantees that the integers are equal.