In order to understand the Central Limit Theorem, I am comparing a $Binomial(n,p)$ variable with a large $n$ and a normal variable with mean $\mu p$ and a standard deviation $\sigma = \sqrt{np(1-p)}$.
In this book they plot it this way :
import random
import math
import matplotlib.pyplot as plt
from collections import Counter
def normal_cdf(x, mu=0,sigma=1):
return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2
def bernoulli_trial(p):
return 1 if random.random() < p else 0
def binomial(n, p):
return sum(bernoulli_trial(p) for _ in range(n))
def make_hist(p, n, num_points):
data = [binomial(n, p) for _ in range(num_points)]
# use a bar chart to show the actual binomial samples
histogram = Counter(data)
plt.bar([x - 0.4 for x in histogram.keys()],
[v / num_points for v in histogram.values()],
0.8,
color='0.75')
mu = p * n
sigma = math.sqrt(n * p * (1 - p))
# use a line chart to show the normal approximation
xs = range(min(data), max(data) + 1)
# ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma) for i in xs]
ys = [normal_cdf(i, mu, sigma) for i in xs]
plt.plot(xs,ys)
make_hist(0.75,100,10000)
And it gives back :
I don't understand why I have to add and subtract two normal cumulative distribution functions rather than one normal distribution function to compare a Binomial with a normal variable ?
When plotting it with a normal distribution function it is exactly the same :
xs = range(min(data), max(data) + 1)
ys = [normal_pdf(i, mu, sigma) for i in xs]
plt.plot(xs,ys)
def normal_pdf(x, mu=0, sigma=1):
sqrt_two_pi = math.sqrt(2 * math.pi)
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))
Which gives back :
So why bothering with these cumulative distribution function ? What proves they do the job ?
The point is that the normal approximation to $P(X=k)$ for a Binomial($n,p$) random variable $X$, using the continuity correction, is $\Phi \left ( \frac{k+1/2-np}{\sqrt{np(1-p)}} \right ) - \Phi \left ( \frac{k-1/2-np}{\sqrt{np(1-p)}}\right )$ where $\Phi$ is the standard normal CDF. This is the probability that the normal approximant falls in an interval of length $1$ centered at $k$. This can be further approximated by $\frac{1}{\sqrt{np(1-p)}} \phi \left ( \frac{k-np}{\sqrt{np(1-p)}} \right )$ where $\phi$ is the standard normal PDF, by neglecting the variation of $\phi$ on this interval. (Effectively we are approximating the integral of $\phi$ on this interval using the midpoint rule.)
In your example $\sqrt{np(1-p)}=\sqrt{75}/2$ which is a little bit more than $4$, so this second approximation is pretty good. (The first approximation is just decent, not great.)