Sampling as adding random variables, especially binomial RVs

34 Views Asked by At

Is sampling equivalent to adding random variables? I'm a bit confused because as we can see that the binomial distribution becomes more and more shaped like a normal distribution as $n$ increases. We're generally told it follows from the CLT, but why exactly is that? We're not talking about multiple samples, so we can't create a sampling distribution. Also generally when taking a sample with a large number of observations, it seems that we converge to a normal distribution anyways. So then, would this mean sampling is equivalent to summing up RVs from a sample?

1

There are 1 best solutions below

8
On BEST ANSWER

The central limit theorem says that the sum or mean of n iid random variables (with the same mean and variance), for n>25 ish, is normally distributed. So since a binomial random variable is the sum of bernoulli random variables with the same parameter, then the binomial random variable will follow an approximately normal distribution. You talk about "multiple samples." In this case, one observation of the binomial random variable is considered "one sample." Their distribution is normal. (See the histogram below for an approximation to the normal curve.) One observation of the binomial random variable would fall into one of the bins in the histogram and can be considered a single sample.

Notice that the CLT does not care what the distribution of the random variables in the sample is, just their mean and variance. In this case, the sampling distribution of the mean of 100 bernoulli random variables (also known as binomial, divided by the number of trials) and the sample distribution of the mean of 100 uniform random variables give sampling distributions that are very similar. The key is that the random variables have to have the same mean and variance.

enter image description here

In terms of sampling from a population, we have to assume that the samples are iid when they are really not for the CLT to make any sense. Obviously we are sampling without replacement, so if we were to take a sample of size $n$ from a population of size $n$, then there would be no variance whatsoever and the distribution of the sample mean/sample sum would not be normal, in fact, it would just be a point mass. So when we talk about sampling from a population, the implicit assumption is that the population is large enough so that we can pretend that we are taking iid samples with replacement with mean $\mu$ and variance $\sigma^2$.