I'm doing a bit of statistics in Python. My Python's pretty solid, but it's been almost a decade since I finished my maths degree, and I'm struggling a bit with the statistics.
I've got two data sets, $X \sim \mathcal{N}(20, 2^2)$ and $Q \sim \mathcal{N}(70, 7^2)$. I then want a weighted combination of these two sets, called $X'$, drawing 98% of the data from $X$ and the other 2% from $Q$. Using the formulae given here, we then have $X' \sim \mathcal{N}(\mu, \sigma^2)$, where:
$$ \mu = (0.98 \times 20) + (0.02 \times 70) = 21\\ \sigma^2 = (0.98 \times (2^2+20^2)) + (0.02 \times (7^2+70^2)) - 21^2 = 53.9 $$
So far so good! (And the results I'm getting from my Python program tie in with my calculations.) But here's where I'm stuck...
I want now to look at the distributions for the sample means, $\bar{X}$ and $\bar{X'}$. I was always taught that, if $A \sim \mathcal{N}(m, s^2)$, then $\bar{A} \sim \mathcal{N}(m, \frac{s^2}{n})$. Setting $n = 1000$, this rule works perfectly for $\bar{X}$: it predicts a mean of 20 and standard deviation of $\frac{2}{\sqrt{1000}} \approx 0.0632455532$. However, for $\bar{X'}$ it predicts a mean of $21$ (correct) but a standard deviation of $\sqrt{\frac{53.9}{1000}} \approx 0.23216373532$ (not correct; Python gives $\approx 0.07$).
Does anyone have any ideas as to what might be going on here?
Here's my Python for anyone who's interested:
from numpy.random import normal
from numpy import mean, std
X_bar = []
X_dash_bar = []
for _ in range(1000):
X = list(normal(20, 2, 1000))
X_dash = list(normal(20, 2, 980))+list(normal(70, 7, 20))
X_bar.append(mean(X))
X_dash_bar.append(mean(X_dash))
print("mean(X_bar) = "+str(mean(X_bar)))
print("std(X_bar) = "+str(std(X_bar)))
print("mean(X_dash_bar) = "+str(mean(X_dash_bar)))
print("std(X_dash_bar) = "+str(std(X_dash_bar)))
Which outputs:
mean(X_bar) = 20.00089792500475
std(X_bar) = 0.06364965544782437
mean(X_dash_bar) = 21.001692387965356
std(X_dash_bar) = 0.06971151733127197
If I'm right in interpreting $X'$ as $YQ + (1-Y)X$ where $Y \sim \operatorname{Bernoulli}(p=2/100)$ is independent of $Q$ and $X$, then $X'$ is definitely not normal. But that is not the problem with your calculation. $X'$ does have mean $pE(Q)+(1-p)E(X) = 21$ and variance $p E(Q^2) + (1-p)E(X^2) - E(X')^2 = 53.9$. That means the standard deviation of the mean of 1000 draws from $X'$ is $\sqrt{53.9/1000}\approx 0.23$, and the sample standard deviation should be a decent approximation to this.
Indeed,
gives about 21 as the mean and about 0.23 as the SD, as expected.
After seeing your code, what you are calculating is observations of the random variable $Z = (1/1000)( X_1 + \cdots + X_{980} + Q_1 + \cdots + Q_{20})$ where $X_i \sim N(20,4)$ and $Q_i \sim N(70,49)$ are independent. $Z$ is normal, and the sample sd you get is an estimator of the sd of $Z$ which is $\sqrt{(1/1000)^2(980\times 4 + 20\times 49)}=0.07$, agreeing with your computations.
$Z$ is not equal to the mean of 1000 draws from $0.98X + 0.02Q$ (this mean would have variance $(0.98 ^2 \times 4 + 0.02^2 \times 49)/1000 \approx 0.0039$) : rather it's normal with mean $0.98 \times 20 + 0.02 \times 70$ and variance $$ \frac{ 980 \times 2^2 + 20 \times 7^2}{1000^2} = 0.0049$$