When is the standard deviation of the sample mean (of a normally distributed variable) NOT $\frac{\sigma}{\sqrt{n}}$?

89 Views Asked by At

I'm doing a bit of statistics in Python. My Python's pretty solid, but it's been almost a decade since I finished my maths degree, and I'm struggling a bit with the statistics.

I've got two data sets, $X \sim \mathcal{N}(20, 2^2)$ and $Q \sim \mathcal{N}(70, 7^2)$. I then want a weighted combination of these two sets, called $X'$, drawing 98% of the data from $X$ and the other 2% from $Q$. Using the formulae given here, we then have $X' \sim \mathcal{N}(\mu, \sigma^2)$, where:

$$ \mu = (0.98 \times 20) + (0.02 \times 70) = 21\\ \sigma^2 = (0.98 \times (2^2+20^2)) + (0.02 \times (7^2+70^2)) - 21^2 = 53.9 $$

So far so good! (And the results I'm getting from my Python program tie in with my calculations.) But here's where I'm stuck...

I want now to look at the distributions for the sample means, $\bar{X}$ and $\bar{X'}$. I was always taught that, if $A \sim \mathcal{N}(m, s^2)$, then $\bar{A} \sim \mathcal{N}(m, \frac{s^2}{n})$. Setting $n = 1000$, this rule works perfectly for $\bar{X}$: it predicts a mean of 20 and standard deviation of $\frac{2}{\sqrt{1000}} \approx 0.0632455532$. However, for $\bar{X'}$ it predicts a mean of $21$ (correct) but a standard deviation of $\sqrt{\frac{53.9}{1000}} \approx 0.23216373532$ (not correct; Python gives $\approx 0.07$).

Does anyone have any ideas as to what might be going on here?


Here's my Python for anyone who's interested:

from numpy.random import normal
from numpy import mean, std

X_bar = []
X_dash_bar = []
for _ in range(1000):
  X = list(normal(20, 2, 1000))
  X_dash = list(normal(20, 2, 980))+list(normal(70, 7, 20))
  X_bar.append(mean(X))
  X_dash_bar.append(mean(X_dash))

print("mean(X_bar) = "+str(mean(X_bar)))
print("std(X_bar) = "+str(std(X_bar)))
print("mean(X_dash_bar) = "+str(mean(X_dash_bar)))
print("std(X_dash_bar) = "+str(std(X_dash_bar)))

Which outputs:

mean(X_bar) = 20.00089792500475
std(X_bar) = 0.06364965544782437
mean(X_dash_bar) = 21.001692387965356
std(X_dash_bar) = 0.06971151733127197
1

There are 1 best solutions below

13
On BEST ANSWER

If I'm right in interpreting $X'$ as $YQ + (1-Y)X$ where $Y \sim \operatorname{Bernoulli}(p=2/100)$ is independent of $Q$ and $X$, then $X'$ is definitely not normal. But that is not the problem with your calculation. $X'$ does have mean $pE(Q)+(1-p)E(X) = 21$ and variance $p E(Q^2) + (1-p)E(X^2) - E(X')^2 = 53.9$. That means the standard deviation of the mean of 1000 draws from $X'$ is $\sqrt{53.9/1000}\approx 0.23$, and the sample standard deviation should be a decent approximation to this.

Indeed,

from scipy.stats import norm, bernoulli, tstd
import numpy as np

def draw_n(n):
    y = bernoulli.rvs(0.02, size=n)
    x = norm.rvs(loc=20, scale=2, size=n)
    q = norm.rvs(loc=70, scale=7, size=n)
    return y * q + (1 - y) * x

means = [np.mean(draw_n(1000)) for i in range(10000)]
print(np.mean(means))
print(tstd(means))

gives about 21 as the mean and about 0.23 as the SD, as expected.

After seeing your code, what you are calculating is observations of the random variable $Z = (1/1000)( X_1 + \cdots + X_{980} + Q_1 + \cdots + Q_{20})$ where $X_i \sim N(20,4)$ and $Q_i \sim N(70,49)$ are independent. $Z$ is normal, and the sample sd you get is an estimator of the sd of $Z$ which is $\sqrt{(1/1000)^2(980\times 4 + 20\times 49)}=0.07$, agreeing with your computations.

$Z$ is not equal to the mean of 1000 draws from $0.98X + 0.02Q$ (this mean would have variance $(0.98 ^2 \times 4 + 0.02^2 \times 49)/1000 \approx 0.0039$) : rather it's normal with mean $0.98 \times 20 + 0.02 \times 70$ and variance $$ \frac{ 980 \times 2^2 + 20 \times 7^2}{1000^2} = 0.0049$$