Sampling Distributions: Sample size of 1 vs Sample Size of m

313 Views Asked by At

I saw this example from a website

Suppose there is a jar containing many gumballs, each with a unique number on it. The numbers range from 0 to 32 and there is an equal number of gumballs with each number. A student set out running an experiment with the following procedure: Pick five gumballs from the jar, calculate the mean of the numbers on the gumballs, write down the result on a piece of paper, and put the gumballs back to the jar. Repeat the process 499 times so altogether there are 500 means recorded.

So how is it compared to an approach with a sample size of 1, and without replacement, so he picks 2500 gumballs at once? Is not it a better estimation of the mean?

Is picking an N/m sample of sample size m, is better than picking an N sample of size 1 when estimating a population mean? In which case the variance will be higher?

2

There are 2 best solutions below

2
On BEST ANSWER

Let's first clearly define some terminology here.

  • A draw consists of the process of taking a single gumball from the jar and observing its value.
  • A sample consists of the process of taking five draws from the jar and observing each of their values.
  • A draw with replacement consists of a draw for which the gumball is returned to the jar after observing its value.
  • A draw without replacement consists of a draw for which the gumball is NOT returned to the jar.
  • A sample with replacement consists of a sample for which the sample of five balls is returned to the jar.
  • A sample without replacement consists of a sample for which no balls are returned to the jar.
  • A sample mean is the arithmetic mean of the values observed from a single sample.

As the scenario is described, the draws are without replacement but the samples are with replacement. However, because we are told that there are "many" balls, and the proportion of balls that are labeled with a given number (from 0 to 32) is equal, we can assume that the sampling distribution of individual draws without replacement is approximately the same as the sampling distribution of draws with replacement; that is to say, because there are many gumballs, individual draws are assumed to be independent and identically distributed.

Now, under the assumption that samples are taken with replacement (as described in the given scenario), each sample is also independent and identically distributed. So the sampling distribution of the sample mean takes on a simple form that does not depend on the number of gumballs in the jar, only the number of samples taken (500) and the number of draws in each sample (5).

If we instead take samples without replacement, as you propose, then the samples are no longer independent and identically distributed, because the removal of previous samples means that subsequent samples are not taken from the same population. This makes the computation of the sampling distribution dependent on the total number of balls and their type.

Now, your question is whether sampling with replacement or without replacement gives a better estimate of the true population mean of the value on all the balls. The truth is, it depends on how many balls are in the jar and how many samples you take. If you take as many samples without replacement as are possible for the number of balls in the jar (e.g., the jar has 500 balls and you take 100 samples of 5 draws each), certainly that resultant sample mean will be best because you've observed all of the balls. But now suppose there are infinitely many balls in the jar--then it makes absolutely no difference--both will estimate the true population mean equally well, so long as you take the same number of samples in either scenario.

As you might expect, a single sample consisting of 2500 draws without replacement, as opposed to 500 samples with replacement of 5 draws without replacement, will be better at estimating the true population mean if the number of balls in the jar is at least 2500, but finite. But the sampling distribution is much more complicated to express in the non-replacement scenario.

2
On

As you suspect, the variance will be higher when you allow replacement of the balls. Replacement makes samples like $\{0,0,0,0,0\}$ more possible than they would be without replacement, since you might reuse the same balls. And samples like this have averages that are more extreme.