How does using one distribution as another's sample size affect variance?

134 Views Asked by At

How does using one distribution as another's sample size affect variance?

For example, let's say I roll a 6-sided dice and record the number shown. Then, I roll 'that many' 6 sided dice more and record the sum of the extra dice rolled. What is the variance for this particular problem, and how does it scale to continuous distributions like the normal distribution?

3

There are 3 best solutions below

2
On BEST ANSWER

Let $N$ take values in the natural numbers and $X_i$ be iid as the other variable; you want the variance of $Y=\sum_{i=1}^N X_i$. The conditional variance formula gives $$Var(Y)=Var(E(Y|N))+E(Var(Y|N))\\ =Var(NE(X_1))+E(NVar(X_1))\\ =(EX_1)^2Var(N)+Var(X_1)EN.$$

In case $N, X_i$ are taken from dice rolls, I get $EX_1=EN=\sum_{i=1}^6i/6=7/2,$ $VarX_1=VarN=\sum i^2/6-(7/2)^2=91/6-49/4$, $VarY=(91/6-49/4)(63/4)$.

2
On

The variance of a single dice roll is $\frac{35}{12}$. The outcomes of the dices are indepedent. Therefore the variance of the sum of n dice rolls is $n\cdot \frac{35}{12}$ You can consider each dice roll as a random variable. And for the sum of $n$ rolls you can apply the central limit theorem. For suffcient large number of dice rolls the sum of the outcomes are normally distributed as $\mathcal N(\mu, \ \sigma ^2)=\mathcal N(3.5\cdot n, \ n\cdot \frac{35}{12}) \ $. 3.5 is the expected value of a single roll.

Let X be the random variable for the sum of $100$ dice rolls. Since it it hard to calculate directly the probility for a specific sum we approximate the distribution by the normal distribution.

Let´s calculate the probability that a sum of $100$ dice rolls is smaller or equal $370$. The expected value is $350$ and the variance is $100\cdot \frac{35}{12}$. Therefore

$$P(X\leq 370)\approx\Phi\left( \frac{x-n\cdot \mu}{\sqrt{n\cdot \sigma}} \right)=\Phi\left( \frac{370-100\cdot 3.5}{\sqrt{100\cdot \frac{35}{12}}} \right)=\Phi(1.1704115)$$

$\Phi(\cdot)$ is the cdf of the standard normal distribution with the expeted value of $\mu=0$ and and the variance of $\sigma ^2=1$. To get the value you can use this calculator or a table. The output of the calculator is $\Phi(1.1704115)=0.87980$

I did some simulations with excel. I made $4,000\times 100$ rolls with one dice (The excel function is RANDBETWEEN) From every 100 rolls I calculated the sum and evaluated if the sum is smaller or equal to 370 or not. The proportion of the 4,000 repeated experiments was, for instance, $0.87875$ But the $4,000$ experiments can be also repeated. Here are some additional values: $0.888,0.87925,0.88875,0.8935,0.87625$

Conclusion: It can be said, that the probability that the sum of 100 rolls is smaller than 371 is about $88\%$. By using the approximation it is not necessary neither to do some simulation nor looking for the exact distribution of $X$.

3
On

Comment:

The very nice Answer by @snarfblatt gives both the derivation and the numerical answer. Sometimes this general type of problem is known as a 'random sum of random variables'.

Here is a brief simulation in R, which may help you visualize the process. It is based on a million repetitions of your experiment. Answers should be accurate to two or three significant digits.

 m = 10^6;  y = numeric(m)
 for(i in 1:m) {
   n = sample(1:6, 1) # how many dice to roll
   y[i] = sum(sample(1:6, n, rep=T)) }
 mean(y)
 ## 12.24234  # aprx E(Y) = (3.5)^2 = 12.25
 sd(t)
 ## 6.774848  # aprx SD(Y) = sqrt(45.9375) = 6.778
 mean(y < 7)
 ## 0.253512  # aprx P(Y < 7)
 pnorm(6.5, 12.25, 6.778)
 ## 0.1981263 # useless normal approximation

A relative frequency histogram of the million simulated totals suggests its pdf, which is only 'partly normal'. (In my experience, assuming 'some kind of CLT' and trying to fit a normal curve is not a feasible way to get probabilities for such random sums.) The values 1 through 6 arise from getting a 1 on the initial die. The largest possible value 36 can result from getting a six on the first die and on all six resulting dice rolls (this occurred 4 times in a million experiments).

enter image description here