Comparing two normals analytically differs from numPy test

55 Views Asked by At

Comparing two normal distributions explained, how to find $P(B>A)$ for every $A \sim N(\mu_1,\sigma_1), B \sim N(\mu_2,\sigma_2), $

In my case, there a two binomials - $A \sim Bin(100,0.52)$ and $B \sim Bin(100,0.47)$.

My task is to find a probability of $B > A$, that is, number of successes in $B$ being greater than in $A$.

I used Central Limit Theorem (in this case, samples are distributed as $N(np,npq)$), which should work for such sample sizes. Resulting distributions are $N_a(52,24.96)$ and $N_b(47,24.91)$

$P(B>A)$ in this case was calculated as $$P(B-A)>0 \simeq N(-5,50)>0 \simeq N(0,1)>5/7$$

Z-score for $5/7(=0.707)$ is $0.24$

However, real tests I ran with numPy shows a different picture. The following code -

import numpy as np
t = 10_000_000
res = sum(np.random.binomial(100,0.52, t) < np.random.binomial(100,0.47, t))
print(res/float(t))

results in 0.2178. Multiple tests show the same result, even if number of trials changes. Difference only appears in 4th significant digit.

Is this situation explainable by variance, numerical errors or other factors? Is there any errors in my approach to the problem?

1

There are 1 best solutions below

0
On BEST ANSWER

You asked if there is any thing wrong with your approach and there is not. The CLT approximation is after all an approximation and will therefore not give you an exact answer, but an approximate answer. I would say that an error of $|0.24 - 0.2178| = 0.0222$ is not optimal but acceptable.

But is there any way we can improve the approximation (without doing exhaustive calculations)? Well one source of error is that we are approximating a discrete distribution with a continuous distribution, so perhaps we can apply some sort of "continuity correction" (https://en.wikipedia.org/wiki/Continuity_correction). The key observation is that, since $B-A$ is discrete and can only take integer values, we have that $$P(B-A > 0) = P(B-A \geq 1) = P(B-A \geq \frac{1}{2})$$ Again using a CLT approximation we get $$P(B-A > 0) = P(B-A \geq \frac{1}{2}) = P(\frac{B-A + 5}{\sqrt{24.96 + 24.91}} \geq \frac{5.5}{7.06}) \approx P(Z \geq 0.779) = 0.218$$ which is much closer to the result achieved by your simulation.