What is the correct standard deviation when splitting a sample?

100 Views Asked by At

I roll a four-faced die 1000 times, but I have 100 dies, so I seperate into 10 rolls of 100 each and tally the result. I want to calculate the standard deviation of the 0 count. As an example, here's a result:

{0: 251, 1: 254, 2: 271, 3: 224}, $\mu = \frac{251}{1000} = 0.251$

{0: 30, 1: 24, 2: 26, 3: 20}
{0: 25, 1: 25, 2: 26, 3: 24}
{0: 22, 1: 22, 2: 27, 3: 29}
{0: 23, 1: 26, 2: 30, 3: 21}
{0: 24, 1: 20, 2: 30, 3: 26}
{0: 26, 1: 31, 2: 26, 3: 17}
{0: 22, 1: 23, 2: 32, 3: 23}
{0: 23, 1: 32, 2: 23, 3: 22}
{0: 27, 1: 28, 2: 22, 3: 23}
{0: 29, 1: 23, 2: 29, 3: 19}

Distribution

The first way I do it is by using the normal approximation: $$\sigma_1 = \sqrt{\frac{0.251*(1-0.251)}{1000}} = 0.0137$$.

The second way is to calculate the deviation of the 10 rolls, which gives: $$\sigma_2 = \sqrt{\frac{(0.3-0.251)^2+(0.25-0.251)^2+\cdots+(0.29-0.251)^2}{10}}=0.027$$

I tried changing and increasing both the total size and the size of the tally, but the results never approach each other. I think they are both consequences of the central limit theorem, and the discrepancy is due to sampling technique? Which is more correct, or are they both wrong? What's the right way to find $\sigma$ of 0, or 1, etc.? Thank you!

Here's the Python code I used to generate the problem:

import numpy as np
import collections

small = 100
big = 1000

die = np.random.randint(0,4,big)
diedict = collections.Counter(die)
print(dict(sorted(diedict.items()))) #the total tally
std1 = np.sqrt(diedict[0]/big*(1-diedict[0]/big)/big)

sumsquare=0
for i in range(0,big,small):
    print(dict(sorted(collections.Counter(die[i:i+small]).items()))) #the seperate rolls
    sumsquare += (collections.Counter(die[i:i+small])[0]/small-diedict[0]/big)**2

std2 = np.sqrt(sumsquare/(big/small))
print(std1,std2)

plot_histogram(diedict)
3

There are 3 best solutions below

6
On BEST ANSWER

There are multiple ways to interpret what's going on here.

We could assume the dice are all fair four-sided dice and that what you have done is an exercise in sampling from a population consisting of all possible rolls of a fair four-sided die. In that case you have $10$ samples of $100$ rolls per sample, which you can combine into a single sample of $1000$ rolls.

Of course what you have done in python is merely a simulation of the rolls of fair four-sided dice, but let's accept it as a reasonable proxy for the ideal mathematical process. (For what it's worth, even if you used real dice you would only be approximating the rolls of fair four-sided dice, because we cannot be sure that all the dice are precisely fair given their construction and the way you roll them.)

On the other hand, we could say that what you have done is to use your simulated dice to generate a population of $1000$ individuals, each of which has a numeric value. Exactly $251$ individuals in the population have the numeric value $0,$ which means that if you selected an individual from this population at random and asked if its value is $0,$ the answer ($1$ for true, $0$ for false) is a Bernoulli variable with mean exactly $0.251.$

What exactly then is "the 0 count"?

If the 0 count means the number of zeros in the observation of one roll, where the observation is chosen at random from your $1000$ total observations, then the 0 count has mean $\mu = 0.251$, just as you stated.

The standard deviation of the 0 count for an observation chosen at random from this population is $\sqrt{0.251(1-0.251)} \approx 0.43359.$


For the following, let's take the interpretation that your data as merely a sample of $1000$ observations from the population of all possible rolls of fair four-sided dice. Then $0.251$ is only the mean number of 0s per die observed in your sample That is it is the sample mean. This is an estimate of the population mean, but not necessarily exactly equal to the population mean.

In this interpretation, you have $251$ observations where the 0 count is $1$ and $749$ where it is $0.$ The sample standard deviation is $s = \sqrt{0.251(1-0.251)} \approx 0.43359$ (the same as when we regarded the $1000$ rolls as the entire population), but the usual estimate for the standard deviation of the population is slightly larger, $$ \hat\sigma = \sqrt{\frac{251(1 - 0.251)^2 + 749(0 - 0.251)^2}{999}} \approx 0.43381. $$

We might also be interested in the standard error of the mean. That's a measurement of how much your sample mean ($0.251$ for this sample) was likely to have varied from the population mean (which is $0.25$). (It's actually the standard deviation of the population of all possible random samples of the same size from the underlying population.) We can estimate the standard error of the mean from the sample standard deviation: $$ \mathop{SEM} = \frac{s}{N} \approx \frac{0.43359}{\sqrt{1000}} \approx 0.013711. $$ That agrees with what you found in your "normal approximation."


Your second way also appears to be related to the standard error of the mean. Continuing with the interpretation that your data are merely a sample of $1000$ observations from the population of all possible rolls of fair four-sided dice, you have ten samples of $100$ rolls each, each of which has a mean that may vary from the population mean (which is $0.25$ in this interpretation). In this case the standard error of the mean is obtained for each sample by dividing the sample standard deviation by $\sqrt{100},$ resulting in standard errors that range from about $0.0414$ to $0.0458.$ The sample that happens to exactly match a population of fair four-sided dice, where the 0 occurs $25$ times, has standard error $0.0433.$

As it happens, you have more than the expected number of sample means within a range of $\pm$ two standard errors, whether you count from the (theoretical) population mean or the mean of the sample of $1000$ rolls. Maybe this is due to a defect in the random number generator, but it could just be luck. Either way, you have a smaller amount of deviation than normal, so when you take your sample of $10$ observations of taking samples of $100$ rolls, and take the sample standard deviation of those $10$ observations, you get a result less than the standard error of any of the individual samples of $100.$

So if you consider your "second way" as a way of estimating the standard error of a sample of $100$ by taking ten samples of $100$ and taking the sample standard deviation of those ten observations, you arrive at an underestimate of the standard error of the mean for $100$ rolls.

To be clear: the result you get from your "second method" is (somewhat) surprisingly small. The fact that it is larger than the standard error of a sample of $1000$ is a good thing, because the standard error of the mean of $100$ rolls should be larger than the standard error of $1000$ rolls. The only discrepancy is that there should be an even larger difference between the two results.


If we do not assume the dice are fair, things get a little more complicated. If the dice are not all fair, are they all unfair in the exact same way, or can they be unfair in different ways? In the first case we can take $0.251$ as the best estimate of the mean 0 count for each die; in the second case $0.251$ is only the estimated mean of the means, where each die might have a different mean 0 count. The second case violates the usual assumptions behind a lot of the formulas we have used here.

2
On

The number of occurrences of a particular face has a binomial distribution with parameters $n$ and $\frac14$

so the number of occurrences has a mean of $\frac n4$ and variance $\frac3{16} n$ and standard deviation $\sqrt{\frac3{16} n}$

and the proportion of occurrences has a mean of $\frac 14$ and variance $\frac3{16 n}$ and standard deviation $\sqrt{\frac3{16 n}}$.

When $n=1000$ this last standard deviation is $\sqrt{0.0001875} \approx 0.0137$, close to what you found. If you want the standard deviation for the proportions of a particular face from $1000$ attempts, this is the better approach

When $n=100$ this last standard deviation is $\sqrt{0.001875} \approx 0.0433$, and if you repeated your simulations you should get values around this. Your particular example was low though not exceptionally low (and you did not make any adjustments for using the sample mean to calculate the sample standard deviation)

0
On

My own take to this question.

Here's the difference in statistical meaning:

  1. $\sigma_1$ is the standard deviation of the ratio of 0 when it is rolled 1000 times.
  2. $\sigma_2$ is the standard deviation of the ratio of 0 when it is rolled 100 times.

Here are the differences in sampling technique, as it was demonstrated in the question:

  1. There are 2 ways to calculate $\sigma_1$.

Firstly, $\sigma_{1 \text{theory}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{3}{16\cdot 1000}}$ is calculated with the assumption of a fair die. It is representative of theoretical prediction of 1000 rolls of a fair die.

Secondly, $\sigma_{1 \text{single}}$ is calculated from a special formula assuming binomial distribution (that each roll within a trial is either 0 or not 0 and obey the same bias). Through sampling, we know the parameters of that formula, mean $\mu = 0.251$ and $n = 1000$. Then, $\sigma_{1 \text{single}}= \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.251(1-0.251)}{n}}$. It is representative of experimental result of that single trial of 1000 rolls that we just did only, and don't care about other ones. The die is not necessarily fair.

  1. There are 3 ways to calculate $\sigma_2$.

Firstly, $\sigma_{2 \text{theory}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{3}{16 \cdot 100}}$ is calculated with the assumption of a fair die. It is representative of theoretical prediction of 100 rolls of a fair die.

Secondly, it is possible to assume the same binomial distribution for the rolls within each trial. Therefore, $\sigma_{2 \text{single}} = \sqrt{\frac{p(1-p)}{n}}$ represent experimental results of each singly trial of 100 rolls. There will be 10 of them, so $\sigma_{2 \text{single 1}}, \cdots, \sigma_{2 \text{single 10}}$.

Thirdly, $\sigma_{2 \text{sampling}}$ can be estimated directly from the traditional standard deviation formula, with absolutely no assumption. All we have are the ratio numbers; which could have come from any distribution. It becomes unreasonable to think any of the $\sigma_{2 \text{single}}$ is representative of the 10 trials. Therefore, the method is representative of experimental results of 10 trials of 100 rolls, where the 10 trials could be different.

  1. The third method is the go-to method if we were doing repeated experiments of multiple rolls; we can only assume the rolls within each trial are similar as they are in the same condition, but the distinctive trials themselves could be different (different days, more winds, etc.)

  2. Moreover, we will find that as we do more trials, if the trials themselves share the same bias, then the $\sigma_{2 \text{sampling}} \approx \sigma_{2 \text{theory}}$. It's just that if we were doing experiments, that is not necessarily true.