Illustrating Normal Approximation to Binomial (CLT)

553 Views Asked by At

For a sampling distribution of sample proportion problem (Bernoulli distribution - $\mathbb P(\mathrm{yellow ball}) = 0.6$ out of $10000$ balls, say), I get below discrete distribution (LHS) with $\mu=0.6$ and $\sigma=0.15$. When I try to draw equivalent normal pdf, its scaled up way above. However, when I scale up discrete distribution by sample size (of $10$), both pdf and discrete distribution matches. So far I could not find any glitch in code that could produce this. Is this a proper expected output? How do we justify? Does this mean we have an inherent limitation in applying normal approximation? Kindly explain.

enter image description here

I see that, once $\sigma$ is below $0.4$, the constant part in normal function is above $1$, so no more fit for an approximation when we have discrete dist with $\sigma \leq 0.4$? Sample size $n$ is $10$. If $np$ is at play it only worsens when I increase sample size $n$, reducing the $\sigma$.

If you doubt it still should be a glitch in code, here is with code.

3

There are 3 best solutions below

3
On

If you take a sample $\boldsymbol x = (x_1, \ldots, x_n)$ of size $n$ of independent and identically distributed observations from a Bernoulli distribution with probability parameter $p$, and compute the sample proportion $$\hat p = \frac{1}{n} \sum_{i=1}^n x_i,$$ then $\hat p$ is approximately normal with mean $\mu = p$ and variance $\sigma^2 = p(1-p)/n$. When creating a histogram of $N$ simulations of $\hat p$, you would compute the density of observed proportions; i.e., for each $k \in \{0, 1, \ldots, n\}$, you would compute the number $s_k$ of simulations for which $\hat p = k/n$, then plot a scaled histogram comprising the vertical bars $$\left\{\left(\frac{k}{n}, \frac{n s_k}{N}\right)\right\}_{k=0}^n.$$

When done in this fashion, the resulting histogram will overlay with a normal distribution with the aforementioned mean and variance. Note that the height of the bar is $n s_k/N$, not $s_k/N$. This is where your error lies. The reason is because the height of the bar is not a probability, but a probability density--this should become obvious once you realize that when $\sigma$ is sufficiently small for a normal distribution (how small exactly?), then there will be some value for which the corresponding density will exceed $1$.

Plotting in Mathematica:

G[n_, nn_, p_] := Show[ListPlot[#/{1, nn/n} & /@ 
   Sort[Tally[RandomVariate[BinomialDistribution[n, p], nn]/n]], 
   Filling -> Axis, PlotRange -> {{0, 1}, Automatic}, AspectRatio -> 1], 
   Plot[PDF[NormalDistribution[p, Sqrt[p (1 - p)/n]], x], {x, 0, 1},
   PlotRange -> All]]

G[10, 10000, 0.6]

I make no claims as to the efficiency of the code.

9
On

It seems you are trying to approximate the distribution $\mathsf{Binom}(n = 10, p = .6),$ using $m = 10,000$ iterations. That distribution has mean $\mu = np = 10(.6) = 6,$ variance $\sigma^2 = np(1-p) = 2.4,$ and standard deviation $\sigma = \sqrt{2.4} = 1.5492.$ Then you want to to compare a histogram of the simulation results results with the approximating normal distribution with mean $\mu$ and variance $\sigma.$

If you want someone to critique your Python code, you should ask on a programming site. But I will do a similar simulation using R, and discuss how to make the histogram (nearly) match the normal density function. That part involves basic ideas of statistics and probability.

set.seed(1888);  m = 10000;  n = 10;  p = .6;  mu = 6;  sg=sqrt(2.4)
x = rbinom(m, n, p)      # simulate binomial realizations
cutp=(-1:10)+.5          # cutpoints for histogram
hdr = "Random Sample from BINOM(10,.6) with Normal Approx."
hist(x, br=cutp, prob=T, col="skyblue2", main=hdr)
 curve(dnorm(x, mu, sg), -1, 11, add=T, lwd=2, col="red")
 xx = 0:10; pdf.binom = dbinom(xx, n, p)
 points(xx, pdf.binom, pch=19)

enter image description here

In order to coordinate a histogram with a density curve, you need to realize that the fundamental principle of a histogram is area. In R, the parameter prob=T of the function hist makes a histogram whose bars have area summing to $1.$ I have chosen cutpoints for the histgram to be exactly $1$ unit apart; thus bars have bases of width one unit and the heights will turn out to be approximations of probabilities in the binomial distribution.

diff(pnorm(c(6.5, 7.5), mu, sg))
[1] 0.206982
dbinom(7, n, p)
[1] 0.2149908

Thus, when I plot the normal density curve (with area $1$ underneath), it will appear to run nearly through the tops of the histogram bars. I'm saying "nearly" here because ten thousand iterations isn't enough to approximate the binomial distribution exactly --within the resolution of the figure.

Also, the area under the normal curve between 6.5 and 7.5 is 0.2070, and the binomial probability $P(X = 7) = 0.2150.$ Ordinarily, you can expect about two decimal place accuracy from a normal approximation to a binomial distribution. However, that only works for sufficiently large $n$ and $p$ sufficiently close to $1/2.$ (A common rule of thumb is that both $np$ and $n(1-p)$ must exceed $5,$ not quite achieved here.)

Finally, I plotted exact normal probabilities (black dots) to make it possible to assess how well the simulation performed, approximating the binomial PDF.

3
On

There is a simple graphical reason why the graphs sometimes do not line up: you are graphing two completely different things.

For the plot of the normal density, which is from a continuous distribution, the total area under the curve should be $1.$

But for the plot of a discrete density, the sum of the lengths of the bars should be $1.$

If every bar were one unit wide, then the total length of the bars would also be their total area. You can do this easily in the plot for your second distribution (the one with $\mu \approx 6$), because the bars are spaced $1$ unit apart. If you widen the bars so they are all one unit wide, each bar will just touch the ones on either side of it, and the total area of the bars will be equal to the sum of their lengths.

Try scaling up the distribution by yet another factor of $10$ so that the discrete outcomes are $0, 10, 20, \ldots, 100.$ I think you will find then that the discrete graph and the normal approximation are mismatched in height again, but this time the bars will be much higher than the curve.

In summary, you have one plot based on length and another based on area. These two plots will only line up when you have adjusted the width of the graph so that "length" just happens to coincide with "area." For any other width of the graph, you will get a mismatch between the heights of the bars and the height of the curve.


One way to do a normal approximation of a discrete distribution, by the way, is to divide the normal distribution into sections and assign each section to one of the discrete values. If you do this correctly, you will have two bar graphs, which can then be compared directly.