Subtract two normal cumulative distribution functions rather than plotting a normal one to compare a binomial with a normal variable?

426 Views Asked by At

In order to understand the Central Limit Theorem, I am comparing a $Binomial(n,p)$ variable with a large $n$ and a normal variable with mean $\mu p$ and a standard deviation $\sigma = \sqrt{np(1-p)}$.

In this book they plot it this way :

import random
import math
import matplotlib.pyplot as plt

from collections import Counter 

def normal_cdf(x, mu=0,sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

def bernoulli_trial(p):
    return 1 if random.random() < p else 0
def binomial(n, p):
    return sum(bernoulli_trial(p) for _ in range(n))


def make_hist(p, n, num_points):
    data = [binomial(n, p) for _ in range(num_points)]
    # use a bar chart to show the actual binomial samples
    histogram = Counter(data)
    plt.bar([x - 0.4 for x in histogram.keys()],
    [v / num_points for v in histogram.values()],
    0.8,
    color='0.75')
    mu = p * n
    sigma = math.sqrt(n * p * (1 - p))
    # use a line chart to show the normal approximation
    xs = range(min(data), max(data) + 1)
    # ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma)  for i in xs]
    ys = [normal_cdf(i, mu, sigma)  for i in xs]
    plt.plot(xs,ys)

make_hist(0.75,100,10000)

And it gives back :

enter image description here

I don't understand why I have to add and subtract two normal cumulative distribution functions rather than one normal distribution function to compare a Binomial with a normal variable ?

When plotting it with a normal distribution function it is exactly the same :

    xs = range(min(data), max(data) + 1)
    ys = [normal_pdf(i, mu, sigma)  for i in xs]
    plt.plot(xs,ys)

def normal_pdf(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2 * math.pi)
    return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))

Which gives back :

enter image description here

So why bothering with these cumulative distribution function ? What proves they do the job ?

1

There are 1 best solutions below

2
On

The point is that the normal approximation to $P(X=k)$ for a Binomial($n,p$) random variable $X$, using the continuity correction, is $\Phi \left ( \frac{k+1/2-np}{\sqrt{np(1-p)}} \right ) - \Phi \left ( \frac{k-1/2-np}{\sqrt{np(1-p)}}\right )$ where $\Phi$ is the standard normal CDF. This is the probability that the normal approximant falls in an interval of length $1$ centered at $k$. This can be further approximated by $\frac{1}{\sqrt{np(1-p)}} \phi \left ( \frac{k-np}{\sqrt{np(1-p)}} \right )$ where $\phi$ is the standard normal PDF, by neglecting the variation of $\phi$ on this interval. (Effectively we are approximating the integral of $\phi$ on this interval using the midpoint rule.)

In your example $\sqrt{np(1-p)}=\sqrt{75}/2$ which is a little bit more than $4$, so this second approximation is pretty good. (The first approximation is just decent, not great.)