I'm starting a class in Statistics. and today. we had a brief discussion about sample size. The teacher showed us, the formula for calculating it, if N > 100k
n = N / (1 + Ne^2)
We calculated the sample size required, for a survey in a country with population 200 milliion, assuming the error (e) to be 2%:
N = 200,000,000
n = 200,000,000 / ( 1 + 200,000,000 * 0.02^2)
n = 2499
But I find this number way too small to be significant for the entire population.
I asked why was this was true but her answer wasn't very convincing. She said it has to do with the normal distribution and the confidence interval (95%), that if we want a more precise estimate it would need a larger sample size. But I think 2% error with 95% confidence is quite precise, but still 2500 people can give a good sample for 200 millions.
What am I missing? Why is this any good? Is there some sort of proof that this is True?
There are a few things that are unclear here. For one thing, $$ \frac{200000000}{1 + 200000000 \times 0.02^2} \approx 2499.96875. $$ So the first mystery is why you would round this down to $2499$ rather than rounding it up to $2500.$
You also do not specify whether your confidence interval is one-sided or two-sided. Based on the formulas, I'm guessing it's one-sided.
In any case, it's a well-known fact that we humans have practically no natural ability to answer this sort of question (whether $2499$ is a large enough sample size for $2\%$ confidence in a population of $200$ million). By "no natural ability" I mean that any answer you give based on a hunch or feeling or notion about how large a sample it ought to take is pretty much worthless unless that statement is backed up by rigorous statistical analysis. In other words we are naturally really, really bad at this. Apparently back in the day when our intelligence was evolving, when we needed to gather enough food to eat without being eaten, the ability to design a survey of a population of $200$ million with a $95\%$ one-tailed confidence interval of a $2\%$ error did not confer much of an advantage toward survival.
So here's an example. You poll your sample of $2499$ people and find that $1299$ of them, $51.981\%$ of your sample, say Friday is a better day than Monday. You estimate that at least $51.981\%$ of the total population would say that Friday is better than Monday, with $95\%$ confidence, with a one-sided error of $2\%$ below the estimate.
What would be a violation of your error bounds? Your estimate would be wrong by more than $2\%$ if less than $49.981\%$ of the population prefer Friday. (Since it's a one-sided error bound, we're not worried about numbers that are much higher than $51.981\%.$) So for you to be wrong, fewer than $99962000$ persons in your population must actually prefer Friday.
So if you were wrong, and fewer than $99962000$ prefer Friday, what is the chance that $1299$ or more of your $2499$ people would say they do?
This chance is the probability that when you were randomly selecting persons to poll, you choose at least $1299$ persons from among the $99962000$ who prefer Friday, and no more than $1200$ from among the $100038000$ who do not prefer Friday. It turns out the chance is rather small: not more than $5\%.$ And that's where your $95\%$ confidence comes from: what we mean when we say we have $95\%$ confidence in an answer is that if that answer were wrong there is a $95\%$ or better chance that we would not have chosen that answer. (In this case, if the actual percentage in the population were $2%$ less than $51.981\%,$ there would be at least a $95\%$ chance that there would be fewer than $1299$ Fridays in our sample, and we would have estimated a lower percentage.)
Why can't we get such an accurate estimate with a smaller sample? Suppose we choose only $100$ persons for our sample. If the actual percentage of Friday-lovers in the population were actually only $49.5\%,$ there's still a $34\%$ chance that we'd accidentally choose a sample that included $52$ or more Friday-lovers, and then we'd be giving an estimate that was more than $2\%$ above the actual population percentage. A $34\%$ chance to be wrong is usually considered unacceptable in statistics, so we don't choose such a small sample if we want $2\%$ accuracy.
Why do we need more than $2500$ persons in the sample in order to be much more accurate? Because the chance that we would have gotten $1299$ Fridays in our sample would be much greater than $5\%$ if the actual percentage were closer to $51.981\%$; for example, if the actual percentage were $50.9\%,$ there's a $14\%$ chance we'd still get at least $1299$ Friday-lovers in our sample and we'd be giving an estimate that was more than $1\%$ above the true percentage. A $14\%$ chance to be wrong is still too high, so we don't trust our sample of $2499$ to give us $1\%$ accuracy.