Convergence in distribution of the two-sample $t$-test statistic

547 Views Asked by At

I would appreciate either one of the following, or both:

  • A source (if a book, with page numbers) where I can find this result proven
  • A proof of the result

The question at hand:

Let $X_{11}, \dots, X_{1n_1}$ be independent and identically distributed random variables with mean $\mu_1$ and variance $\sigma_1^2$, and let $X_{21}, \dots, X_{2n_2}$ be independent and identically distributed random variables with mean $\mu_2$ and variance $\sigma_2^2$. Assume $n_1 \neq n_2$ and $\sigma_1^2 \neq \sigma_2^2$.

Denote $\bar{X}_{1, n_1} = \dfrac{1}{n_1}\sum_{i=1}^{n_1}X_{1i}$ and $\bar{X}_{2, n_2} = \dfrac{1}{n_2}\sum_{i=1}^{n_2}X_{2i}$. Also, let $S_1^2 = \dfrac{1}{n_1 - 1}\sum_{i=1}^{n_1}(X_{1i} - \bar{X}_{1, n_1})^2$ and $S_2^2 = \dfrac{1}{n_2 - 1}\sum_{i=1}^{n_2}(X_{2i} - \bar{X}_{1, n_2})^2$.

As $n_1 \to \infty$ and $n_2 \to \infty$, does the statistic $$T = \dfrac{\bar{X}_{1, n_1} - \bar{X}_{2, n_2}}{\sqrt{\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}}}$$ converge in distribution to a random variable; and if so, with what distribution?

Context. The test statistic $T$ is that which arises from Welch's t-test. Conventional statistical wisdom (e.g., here, here) is that regardless of the population distributions of $X_{1i}$ and $X_{2j}$ (i.e., they are not iid normal), the Central Limit Theorem (CLT) may be used so as to justify that $T$ is approximately $\mathcal{N}(0, 1)$. I haven't seen a proof of this, and I am doubtful the classical CLT may be used.

My Efforts. I demonstrated that for a single population, with obvious notational extensions, it holds that $\dfrac{\bar{X} - \mu}{S/\sqrt{n}}$ converges in distribution to a random variable with an $\mathcal{N}(0, 1)$ distribution. However, this result cannot be used in this question.

For one thing, $\bar{X}_{1, n_1} - \bar{X}_{2, n_2}$ cannot be written as a single arithmetic mean as in the result above. For another thing, while $S_1^2 \to \sigma_1^2$ and $S_2^2 \to \sigma_2^2$ in probability, since $\sigma_1^2 \neq \sigma_2^2$ they cannot be "factored out" like in my demonstration, prohibiting direct use of the CLT.

2

There are 2 best solutions below

0
On

Additional assumption: $\frac{n_{2}}{n_{1}}\to c$ and $n_{1}\land n_{2}\to \infty$,

Denote $n\equiv n_{1}\lor n_{2}$. Construct the following triangular array

$$\begin{matrix}Y_{1,1}\\Y_{2,1}&Y_{2,2}\\Y_{3,1}&Y_{3,2}&Y_{3,3}\\ \cdots&\cdots&\cdots&\cdots\\ Y_{n,1}&Y_{n,2}&Y_{n,3}&\cdots&Y_{n,n}\\\cdots&\cdots&\cdots&\cdots&\cdots&\cdots\end{matrix}$$

with $Y_{n,i}\equiv \frac{\sqrt{n_{2}}}{n_{1}}X_{1,i}1_{\{i\leq n_{1}\}}-\frac{1}{\sqrt{n_{2}}}X_{2,i}1_{\{i\leq n_{2}\}}$. Then it remains to prove $$\frac{\sum_{i=1}^{n}Y_{n,i}}{\sqrt{\frac{n_{2}}{n_{1}}\sigma_{1}^{2}+\sigma_{2}^{2}}}\to_{d}\mathcal{N}(0,1) \quad\text{Under }\mathbb{H}_{0}:\mu_{1}=\mu_{2}.$$

By construction $Y_{n,i}$ is row-wise independent, also we have $$\mathbb{E}[Y_{n,i}]=\frac{\sqrt{n_{2}}}{n_{1}}\mu_{1}1_{\{i\leq n_{1}\}}-\frac{1}{\sqrt{n_{2}}}\mu_{2}1_{\{i\leq n_{2}\}},$$ and $$\mathrm{Var}(Y_{n,i})=\frac{n_{2}}{n_{1}^{2}}\sigma_{1}^{2}1_{\{i\leq n_{1}\}}+\frac{1}{n_{2}}\sigma_{2}^{2}1_{\{i\leq n_{2}\}}.$$ This gives $$\sum_{i=1}^{n}\mathbb{E}[Y_{n,i}]=\sqrt{n_{2}}\mu_{1}-\sqrt{n_{2}}\mu_{2}=0\quad \text{(under the null)},$$ and $$\mathrm{Var}\biggl(\sum_{i=1}^{n}Y_{n,i}\biggr)=\sum_{i=1}^{n}\mathrm{Var}(Y_{n,i})=\frac{n_{2}}{n_{1}}\sigma_{1}^{2}+\sigma_{2}^{2}.$$

The desired convergence is guaranteed by triangular array CLT

Lindeberg-Feller Theorem: Let $\{Y_{n,i}\}$ be a row-wise independent triangular array of random variables with $\sum_{i=1}^{n}\mathbb{E}[Y_{n,i}]=0$ and $\sigma_{n}^{2}\equiv\sum_{i=1}^{n}\sigma_{n,i}^{2}$. Let $Z_{n}\equiv\sum_{i=1}^{n}Y_{n,i}$, then $Z_{n}/\sigma_{n}\to_{d}\mathcal{N}(0,1)$ if the Lindeberg condition holds: $$\frac{1}{\sigma_{n}^{2}}\sum_{i=1}^{n}\mathbb{E}[Y_{n,i}^{2}1_{\{|Y_{n,i}|>\varepsilon\sigma_{n}\}}]\to 0,\quad \text{for every }\varepsilon>0.$$

Note that $$ \begin{align*} Y_{n,i}^{2}&=\Bigl(\frac{\sqrt{n_{2}}}{n_{1}}X_{1,i}1_{\{i\leq n_{1}\}}-\frac{1}{\sqrt{n_{2}}}X_{2,i}1_{\{i\leq n_{2}\}}\Bigr)^{2}\\ &\leq 2\Bigl(\frac{n_{2}}{n_{1}^{2}}X_{1,i}^{2}1_{\{i\leq n_{1}\}}+\frac{1}{n_{2}}X_{2,i}^{2}1_{\{i\leq n_{2}\}}\Bigr), \end{align*}$$ separate terms in the summation and by dominated convergence theorem, it's easy to verify the Lindeberg condition.

0
On

To derive the asymptotic distribution, we use same symbol as above post. Denote \begin{align*} T&=\frac{\bar{X}_{1,n_1}-\bar{X}_{2,n_2}}{\sqrt{\dfrac{S_{1,n_1}^2}{n_1}+ \dfrac{S_{2,n_2}^2}{n_2}}}\\ &= \sqrt{\frac{S_{1,n_1}^2/n_1}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}} \frac{\bar{X}_{1,n_1}-\mu_1}{\sqrt{S_{1,n_1}^2/n_1}} \\ &\qquad - \sqrt{\frac{S_{2,n_2}^2/n_2}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}} \frac{\bar{X}_{2,n_2}-\mu_2}{\sqrt{S_{2,n_2}^2/n_2}}+\mu\\ &=\alpha_{1,n_1}\bar{Y}_{1,n_1}+\alpha_{2,n_2}\bar{Y}_{2,n_2}+\mu\\ &=\boldsymbol{\alpha}^\top_n\cdot \overline{\boldsymbol{Y}}_n+\mu,\quad (n=(n_1,n_2)), \end{align*} where \begin{align*} \boldsymbol{\alpha}^\top_n &=(\alpha_{1,n_1},\alpha_{2,n_2}) =\Bigg(\sqrt{\frac{S_{1,n_1}^2/n_1}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}}, -\sqrt{\frac{S_{2,n_2}^2/n_2}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}} \Bigg),\\ \mu &=\frac{\mu_1-\mu_2}{\sqrt{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}}, \\ \overline{\boldsymbol{Y}}_n^\top&= \Big(\bar{Y}_{1,n_1} , \bar{Y}_{2,n_2}\Big) = \Bigg(\frac{\bar{X}_{1,n_1}-\mu_1}{\sqrt{S_{1,n_1}^2/n_1}} , \frac{-(\bar{X}_{2,n_2}-\mu_2)}{\sqrt{S_{2,n_2}^2/n_2}} \Bigg). \end{align*}

The following facts is obvious, as $n\to\infty(n_1\wedge n_2\to\infty)$, \begin{gather*} \frac{S^2_{1,n_1}}{\sigma_1^2}\stackrel{\text{a.s.}}{\longrightarrow}1,\qquad \frac{S^2_{2,n_1}}{\sigma_2^2}\stackrel{\text{a.s.}}{\longrightarrow}1.\\ \frac{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}{\sigma_{1}^2/n_1+\sigma_{2}^2/n_2} \stackrel{\text{a.s.}}{\longrightarrow}1. \tag{1}\\ \overline{\boldsymbol{Y}}_n \stackrel{\text{dist}}{\longrightarrow}N(\boldsymbol{0},\boldsymbol{I}_2).\tag{2} \end{gather*} Denote \begin{equation*} \boldsymbol{a}_n^\top=(a_{1,n_1},a_{2,n_2}) =\Bigg(\sqrt{\frac{\sigma_1^2/n_1}{\sigma_1^2/n_1+\sigma_2^2/n_2}}, -\sqrt{\frac{\sigma_2^2/n_1}{\sigma_1^2/n_1+\sigma_2^2/n_2}}\Bigg), \end{equation*} then $\|\boldsymbol{a}_n\|=1$ and by (1) \begin{gather*} \|\boldsymbol{\alpha}_n-\boldsymbol{a}_n\|\stackrel{\mathsf{P}}{\longrightarrow}0.\\ (\boldsymbol{\alpha}_n-\boldsymbol{a}_n)^\top\cdot \overline{\boldsymbol{Y}}_n \stackrel{\mathsf{P}}{\longrightarrow}0.\tag{3} \end{gather*} From (2), the sequence of distributions of $\{\overline{\boldsymbol{Y}}_n, n\ge 1\}$ is tight, hence from $\|\boldsymbol{a}_n\|=1$, the sequence of distributions of $\{\boldsymbol{a}^\top_n \cdot\overline{\boldsymbol{Y}}_n,n\ge 1\}$ is tight too. Now we prove that the distributions of $\{\boldsymbol{a}^\top_n \cdot\overline{\boldsymbol{Y}}_n,n\ge 1\}$ has unique limit point.

If $\{\boldsymbol{a}^\top_{n'}\cdot\overline{\boldsymbol{Y}}_{n'}\}$ is a subsequence, converging in distribution, no loss generality, we may suppose that $\boldsymbol{a}_{n'}\to \boldsymbol{a}$ and $\|\boldsymbol{a}\|=1$, otherwise, take a converging sub-subsequence. Hence from (2) we have \begin{equation*} \boldsymbol{a}_{n'}\cdot \overline{\boldsymbol{Y}}_{n'} \stackrel{\text{dist}}{\longrightarrow} N(0,1). \tag{4} \end{equation*} (4) means that $N(0,1)$ is the unique limit point of the distributions of $\{\boldsymbol{a_n}^\top \cdot\overline{\boldsymbol{Y}}_n,n\ge 1\}$ and
\begin{equation*} T=\boldsymbol{\alpha}^\top_n\cdot \overline{\boldsymbol{Y}}_n \stackrel{\text{dist}}{\longrightarrow} N(0,1). \tag{5} \end{equation*}

In summary, if $\mu_1\ne\mu_2$, then, as $n\to\infty$, \begin{equation*} |T|\stackrel{\text{a.s.}}{\longrightarrow}+\infty. \end{equation*} If $\mu_1=\mu_2$, then (5) holds as $n\to\infty$.