Sampling distribution for mean difference from two independent Bernoulli populations

125 Views Asked by At

Let's assume that we have two independent Bernoulli populations,$Ber(\theta_1)$ $Ber(\theta_2)$

How do we prove that $\frac{(\bar X_1-\bar X_2)-(\theta_1-\theta_2)}{\sqrt{\frac{\theta_1(1-\theta_1)}{n_1}+\frac{\theta_2(1-\theta_2)}{n_2}}}\rightarrow^d N(0,1)$?

Assume that $n_1\neq n_2$

Any help would be appreciated.

P.S.:I've also posted this in CrossValidated, but since it got no answer, I've decided to also post it here.

2

There are 2 best solutions below

4
On

Assume that we have $n$ random variables of each distribution, then define $$ \bar{X}_1-\bar{X}_2 =\frac{1}{n}\sum_{i=1}^n(X_{1i}-X_{2i})=\frac{1}{n}\sum_{i=1}^nY_i, $$ where $Y_1,...,Y_n$ are iid r.v with $\mathbb{E}Y_i = \theta_1-\theta_2 $ and $$ Var(Y_i)=\frac{1}{n}\left(\theta_1(1-\theta_1)+\theta_2(1-\theta_2)\right). $$ Now you can easily apply the CLT.

If $n_1 \neq n_2$, define then for large enough $n_i$ $$ \bar{X}_i \sim^{approx.} N(\theta_i, \frac{\theta_i(1 - \theta_i)}{n}), $$ thus you can use the fact that difference of two normal r.v is normal with the desired parameters. For more rigorous treatment you have to define $\bar{X}_1 - \bar{X}_2$ for every $(n_1, n_2)$ and then take both $n_1$ and $n_2$ to $\infty$.


For $n_1 \neq n_2$ note that strictly from CLT (or, in particular, Laplace-De Moivre theorem) for $n_i \to \infty$ you get that $$ \sqrt{n_i}(\bar{X}_{n_i} - \theta_i) \xrightarrow{D} \theta_i(1-\theta_i)Z, \quad i\in\{1,2\}, $$ hence, $$ \sqrt{n_1}(\bar{X}_{n_1} - \theta_1) - \sqrt{n_2}(\bar{X}_{n_2} - \theta_2) \xrightarrow{D} \theta_1(1-\theta_1)Z - \theta_2(1-\theta_2)Z = Z(0, \theta_1(1-\theta_1) + \theta_2(1-\theta_2) ), $$ now, note that weak convergence holds for $n_1 \to \infty$ and $n_2\to \infty$. thus starting at some $N \in \mathbb{R}$, you can state that $n_1 \approx n_2 =n$, when you get $$ \frac{\sqrt{n} ( (\bar{X}_{n_1} - \bar{X}_{n_2}) - (\theta_1 - \theta_2) )}{\sqrt{\theta_1(1-\theta_1) + \theta_2(1-\theta_2)}} \xrightarrow{D} N(0,1). $$ For any finite $n_i$ the distribution is only approximately normal. For non "large enough" $n_i$ and imbalanced design the approximation is not that good. For analysis of goodness of this approximation for small $n_i$s, you need to do much more neat analysis and not the asymptotic arguments that I've used.

EDIT:

To be precise with the requirements for $n_1$ and $n_2$, you should assure that they have the same rate of convergence, in other words $$ \frac{n_1}{n_2} \xrightarrow{} c, \quad c \in (0, \infty). $$

1
On

The assumption $n_1\neq n_2$ implies that the limit of the CLT (if it applies) must be a double limit ($\lim_{n_1\to \infty ,n_2\to \infty } \cdots$). (Recall that this is no the same as the iterated limits $\lim_{n_1\to \infty}\lim_{n_2\to \infty } \cdots$ and $\lim_{n_2\to \infty}\lim_{n_1\to \infty } \cdots$ , see eg).

To assert the validity of the CLT for the double limit seems not trivial.

The paper "Necessary and Sufficient Condition for Asymptotic Standard Normality of the Two Sample Pivot" (Majumdar, Majumdar - 2010) mentions a result from the book "Mukhopadhyay, N. (2000) Probability and Statistical Inference" which states that , for any two sequences of iid random variables with finite variances, and independent of each other:

$$ \frac{\overline X_1 -\overline X_2 - (\mu_1 - \mu_2) }{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \tag{1}$$

converges in distribution to $\mathcal N(0,1)$ along any line $n_1/n_2 = \delta \in (0,\infty)$ for $n_1,n_2 \to \infty$. Notice that still this leaves open the the convergence of double limit.