You found some rotten fruit in your farm's harvest of 200,000 apples. How many apples do you need to sample to estimate the number of rotten apples ±20%, 95% confidence?
All sampled fruit needs to be thrown out, so it's important to find the least amount of apples required to reach 95% confidence that the estimate is within ±20% of the actual amount of rotten apples.
You get to sell all remaining (untested) apples to a applesauce cannery. The cannery has the machinery to break and test each individual apple, so they'll eventually know the exact amount of rotten apples.
But your contract demands that if you are outside 20% of your estimate, factory doesn't have to pay for the apples. Your boss says he's alright with 95% confidence in our estimate.
How many apples do you test/destroy/sample in order to determine your estimate, maximizing your chance to be paid for the remaining untested apples?
If data is helpful, here's what happened last year in each of your orchards. You need to tell the factory how many rotten apples you estimate this year from the same orchard. Unfortunately it's only your second year and the previous farmer burned his logs. ;)
Last Year Last Year This Year This Year
orchard_id rotten_apples total_apples total_apples est_rotten_apples
A 4543 60913 63959 ?
B 6213 57862 60755 ?
C 196606 665926 699222 ?
D 13858 263879 277073 ?
E 12785 141849 148941 ?
F 7441 153486 161160 ?
G 6362 228504 239929 ?
H 3738 57030 59882 ?
I 193528 2394342 2514059 ?
J 73251 867756 911144 ?
K 45118 224893 236138 ?
L 27216 291949 306546 ?
M 63013 768708 807143 ?
N 1240 20982 22031 ?
O 11856 121023 127074 ?
P 5185 30041 31543 ?
Q 5496 26408 27728 ?
R 558 11339 11906 ?
S 8118 145720 153006 ?
T 10568 104308 109523 ?
U 7313 43311 45477 ?
V 14163 169633 178115 ?
W 1897 16339 17156 ?
X ? ? 200000 ?
Each apple is free to test, but the factory will pay you $0.10 per apple (even rotten ones), so long as your estimate is within ±20% (per orchard). If you didn't have to sample any apples, that's $720,951 this year (from the above 23 orchards). How can I maximize your farm's revenue this year?
Extra credit: You just acquired Orchard X. This year it harvested 200,000 apples. Given what's happened in the other orchards, how many should you sample from this new orchard to reach 95% confidence that your estimate will fall within ±20%?
I am going to ignore some of the higher-level underlying math that is at play here. I'm assuming that this is the type of response you are interested in.
You are trying to calculate confidence intervals for a sample with unknown mean and unknown standard deviation. This means that you will somehow have to derive these from the data. Assuming the true underlying distribution is normal (this is kind of an okay assumption in general, but again, requires a lot of math that you would not be interested in seeing).
The standard deviation $\sigma$ is replaced by the "estimated standard deviation" $\bar{\sigma}$, also known as the $\bf{standard \ error}$. Standard error is an estimate for the true value of the standard deviation, $\bf{the \ distribution \ of \ the \ sample \ mean \ \bar{x}}$ is no longer normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$. Instead, the sample mean follows the t-distribution with mean $\mu$ and standard deviation $\frac{\bar{\sigma}}{\sqrt{n}}$ . The t-distribution is also described by its degrees of freedom. For a sample of size $n$, the t distribution will have n-1 degrees of freedom. The notation for a t distribution with $k$ degrees of freedom is $t(k)$. As the sample size n increases, the t-distribution becomes closer to the normal distribution, since the standard error approaches the true standard deviation for large $n$.
Now, for a population with unknown mean $\mu$ and unknown standard deviation, a confidence interval for the population mean, based on a simple random sample of size $n$ is
$\bar{x} \pm t^{*}\cdot \frac{\bar{\sigma}}{\sqrt{n}}$
Where $t^{*}$ is the upper $(1-C)/2$ critical value for the t-distribution with $n-1$ degrees of freedom, $t(n-1)$.
In your case, you need to find $n$ such that $t^{*}\cdot \frac{\bar{\sigma}}{\sqrt{n}}$ is less than or equal to 20.
Since we do not know the value of $\frac{\bar{\sigma}}{\sqrt{n}}$ it may seem that this is impossible, however, we can easily bound $\bar{\sigma}$ by 1, which reduces the problem to finding $t^{*}/{\sqrt{n}}\leq 0.20$. the right value of $n$ for this turns out to be 99. That is,after upper bounding the standard deviation by 1, the minimum number of apples you need to sample to estimate the proportion of rotten apples (assuming rotten apple takes value 1 and non-rotten takes value 0 when eestimating mean and s.d.) is 99. Hope this approach helps.
EDIT: The standard deviation is actually bounded above by 1/2, so sampling 99 appples actually puts you within $\pm 10 $%
In order to estimate the proportion of rotten apples over your entire farm (assuming one giant orchard) you can use roulette wheel select to first decide the sub-orchard [A,B,...,W] to choose a random apple from, and then decide which apple to choose (choosing the random apple within the sub-orchard might be tricky, but you could always take a slightly larger sample if need-be). I would reccomend choosing a number between 1 and the number of trees in the sub-orchard randomly and then picking an apple from that tree (how you order them is not important, and you don't actually need to count just be close). Obviously this isn't going to be completely 100% random, but try to make it "as random as possible." Repeat this process 99 times and calculate the average number of rotten apples. this will be within 10% of true average with 95% confidence.