How can I calculate the number of samples required to reach given confidence level?

620 Views Asked by At

You found some rotten fruit in your farm's harvest of 200,000 apples. How many apples do you need to sample to estimate the number of rotten apples ±20%, 95% confidence?

All sampled fruit needs to be thrown out, so it's important to find the least amount of apples required to reach 95% confidence that the estimate is within ±20% of the actual amount of rotten apples.


You get to sell all remaining (untested) apples to a applesauce cannery. The cannery has the machinery to break and test each individual apple, so they'll eventually know the exact amount of rotten apples.

But your contract demands that if you are outside 20% of your estimate, factory doesn't have to pay for the apples. Your boss says he's alright with 95% confidence in our estimate.

How many apples do you test/destroy/sample in order to determine your estimate, maximizing your chance to be paid for the remaining untested apples?


If data is helpful, here's what happened last year in each of your orchards. You need to tell the factory how many rotten apples you estimate this year from the same orchard. Unfortunately it's only your second year and the previous farmer burned his logs. ;)

            Last Year       Last Year           This Year       This Year
orchard_id  rotten_apples   total_apples        total_apples    est_rotten_apples
A           4543            60913               63959           ?
B           6213            57862               60755           ?
C           196606          665926              699222          ?
D           13858           263879              277073          ?
E           12785           141849              148941          ?
F           7441            153486              161160          ?
G           6362            228504              239929          ?
H           3738            57030               59882           ?
I           193528          2394342             2514059         ?
J           73251           867756              911144          ?
K           45118           224893              236138          ?
L           27216           291949              306546          ?
M           63013           768708              807143          ?
N           1240            20982               22031           ?
O           11856           121023              127074          ?
P           5185            30041               31543           ?
Q           5496            26408               27728           ?
R           558             11339               11906           ?
S           8118            145720              153006          ?
T           10568           104308              109523          ?
U           7313            43311               45477           ?
V           14163           169633              178115          ?
W           1897            16339               17156           ?
X           ?               ?                   200000          ?

Each apple is free to test, but the factory will pay you $0.10 per apple (even rotten ones), so long as your estimate is within ±20% (per orchard). If you didn't have to sample any apples, that's $720,951 this year (from the above 23 orchards). How can I maximize your farm's revenue this year?


Extra credit: You just acquired Orchard X. This year it harvested 200,000 apples. Given what's happened in the other orchards, how many should you sample from this new orchard to reach 95% confidence that your estimate will fall within ±20%?

1

There are 1 best solutions below

7
On

I am going to ignore some of the higher-level underlying math that is at play here. I'm assuming that this is the type of response you are interested in.

You are trying to calculate confidence intervals for a sample with unknown mean and unknown standard deviation. This means that you will somehow have to derive these from the data. Assuming the true underlying distribution is normal (this is kind of an okay assumption in general, but again, requires a lot of math that you would not be interested in seeing).

The standard deviation $\sigma$ is replaced by the "estimated standard deviation" $\bar{\sigma}$, also known as the $\bf{standard \ error}$. Standard error is an estimate for the true value of the standard deviation, $\bf{the \ distribution \ of \ the \ sample \ mean \ \bar{x}}$ is no longer normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$. Instead, the sample mean follows the t-distribution with mean $\mu$ and standard deviation $\frac{\bar{\sigma}}{\sqrt{n}}$ . The t-distribution is also described by its degrees of freedom. For a sample of size $n$, the t distribution will have n-1 degrees of freedom. The notation for a t distribution with $k$ degrees of freedom is $t(k)$. As the sample size n increases, the t-distribution becomes closer to the normal distribution, since the standard error approaches the true standard deviation for large $n$.

Now, for a population with unknown mean $\mu$ and unknown standard deviation, a confidence interval for the population mean, based on a simple random sample of size $n$ is

$\bar{x} \pm t^{*}\cdot \frac{\bar{\sigma}}{\sqrt{n}}$

Where $t^{*}$ is the upper $(1-C)/2$ critical value for the t-distribution with $n-1$ degrees of freedom, $t(n-1)$.

In your case, you need to find $n$ such that $t^{*}\cdot \frac{\bar{\sigma}}{\sqrt{n}}$ is less than or equal to 20.

Since we do not know the value of $\frac{\bar{\sigma}}{\sqrt{n}}$ it may seem that this is impossible, however, we can easily bound $\bar{\sigma}$ by 1, which reduces the problem to finding $t^{*}/{\sqrt{n}}\leq 0.20$. the right value of $n$ for this turns out to be 99. That is,after upper bounding the standard deviation by 1, the minimum number of apples you need to sample to estimate the proportion of rotten apples (assuming rotten apple takes value 1 and non-rotten takes value 0 when eestimating mean and s.d.) is 99. Hope this approach helps.

EDIT: The standard deviation is actually bounded above by 1/2, so sampling 99 appples actually puts you within $\pm 10 $%

In order to estimate the proportion of rotten apples over your entire farm (assuming one giant orchard) you can use roulette wheel select to first decide the sub-orchard [A,B,...,W] to choose a random apple from, and then decide which apple to choose (choosing the random apple within the sub-orchard might be tricky, but you could always take a slightly larger sample if need-be). I would reccomend choosing a number between 1 and the number of trees in the sub-orchard randomly and then picking an apple from that tree (how you order them is not important, and you don't actually need to count just be close). Obviously this isn't going to be completely 100% random, but try to make it "as random as possible." Repeat this process 99 times and calculate the average number of rotten apples. this will be within 10% of true average with 95% confidence.