Sample size that contains the ratio of the population

48 Views Asked by At

If we have a recorded percentage (statistic) on a population, if we take random samples we might not encounter the percentage till e.g. even we exhaust the population.
E.g. a box with $500000$ balls of which $25000$ ($5$%) of them are red and the rest are all white.
My question is what is min/max number of a sample to consider we will be able to see that $5$% of red balls we know exists in the larger population? And not specifically for $500000$ but any population larger than $50000$

1

There are 1 best solutions below

8
On

So you are interested in confidence intervals for a sample average. You probably know the strong law of large numbers for identically distributed random variables

$$ \newcommand{\E}{\mathbb{E}} $$ $$ \bar{X}_n := \frac1n\sum_{i=1}^n X_i \to \mathbb{E}[X_1] $$

Now if you pick items from a population at random, then you sample that population in an iid fashion. If you ask wether item $X_i$ has a certain property, then you are asking whether the indicator $1_{X_i \in A}$ of $X_i\in A$ where $A$ describes the property, is one. Now the $Y_i = 1_{X_i \in A}$ are also independent identically distributed random variables. So the strong law of large numbers applies: $$\begin{aligned} \text{average share in sample} &=\frac{\#\{X_i \in A: i=1,\dots, n\}}{n} \\ &= \frac1n\sum_{i=1}^n 1_{X_i\in A} \\ &\to \mathbb{E}[1_{X_1 \in A}] = \Pr(X_1 \in A) = 0.05 \end{aligned} $$

So the question you are really asking is, how far away will the average $\bar{Y}_n$ be from the expectation $\E[Y_1]$ for a given sample size $n$. Notice that the expectation of the average is the same $\E[\bar{Y}_n] = \E[Y_1]$. If you want to allow a margin of error of $\epsilon$, we want to bound the probability, that we exceed this margin

$$ \Pr(|\bar{Y}_n - \E[\bar{Y}_n]| > \epsilon) $$ These types of inequalities are knwon as Concentration Inequalities. The least sophisticated one is perhaps Chebyshev's Inequality $$ \Pr(|\bar{Y}_n - \E[\bar{Y}_n]| > \epsilon) \le \frac{\mathbb{V}(\bar{Y}_n)}{\epsilon^2} = \frac{\mathbb{V}(Y_1)}{n\epsilon^2} $$

with variance $\mathbb{V}(Y_1)$ of $Y_1$. Since we know that $Y_i$ is bernoulli $\text{Ber}(p)$ with $p=\Pr(X_1\in A)$ distributed, we have $$ \mathbb{V}(Y_1) = p(1-p) = 0.05(1-0.05) = 0.0475 $$ At the same time we could do much better. As bernoulli random variables are bounded, we can apply the Chernoff bound for example.

If you have a large sample, you could argue that $\bar{Y}_n$ is approximately normal distributed by the central limit theorem and use the normal distribution for your bounds. As the sum of bernoulli variables is binomial distributed, you could also use the binomial distribution directly.

In other words: there are many ways you can get your answer.