Question about applying $t$-test if the population is not normally distributed

40 Views Asked by At

A theorem from 'Introduction to Mathematical Statistics' by Hogg et al

Theorem 3.6.1. Let $X_1, \ldots, X_n$ be iid random variables each having a normal distribution with mean $\mu$ and variance $\sigma^2$. Define the random variables $\bar X = \frac{1}{n} \sum_{i=1}^n X_i$ and $S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2$. Then

  • $\bar X$ has a $N(\mu, \frac{\sigma^2}{n})$ distribution.
  • $\bar X$ and $S^2$ are independent.
  • $(n - 1)S^2/\sigma^2$ has a $\chi^2(n - 1)$ distribution.
  • The random variable $T = \frac{\bar X - \mu}{S/\sqrt{n}}$ has a Student t-distribution with $n - 1$ degrees of freedom.

This theorem is the basis for the $t$-test, as far as I understand. Another source ('Statistics for Managers Using Microsoft Excel' by Levine et al) has this remark:

If the population is not normally distributed, you can still use the $t$ test if the population is not too skewed and the sample size is not too small.

This diagram says something similar to Levine et al: if the sample size is $\ge 30$ and the population standard deviation is unknown, one should use the $t$-test.

So, if we want to use the $t$-test, Theorem 3.6.1 should hold. Question: How does the fact that the sample size is not too small let us apply Theorem 3.6.1? One of the hypotheses of the theorem is that $X_1, \dots, X_n$ are normal random variables. But if we only know that the sample size is large (i.e., $n$ is large), this doesn't tell us anything about each $X_i$ being normal. I was thinking about applying CLT (see below), but its conclusion is about $\bar{X}$, not about individual $X_i$s.

Theorem 4.2.1 (Central Limit Theorem). Let $X_1, X_2, \ldots, X_n$ denote the observations of a random sample from a distribution that has mean $\mu$ and finite variance $\sigma^2$. Then the distribution function of the random variable $W_n = \frac{\bar X - \mu}{\sigma/\sqrt{n}}$ converges to $\Phi$, the distribution function of the $N(0, 1)$ distribution, as $n \to \infty$. As we further show in Chapter 5, the result stays the same if we replace $\sigma$ by the sample standard deviation $S$; that is, under the assumptions of Theorem 4.2.1, the distribution of $Z_n = \frac{\bar X - \mu}{S/\sqrt{n}}$ is approximately $N(0, 1)$.

1

There are 1 best solutions below

0
On

Formally, the $t$-test uses the test statistics $$ T = \frac{\bar X - \mu_0}{Sd[\bar X]} = \frac{\bar X - \mu_0}{s/\sqrt{n}} $$ where $\mu_0$ is some reference value. The mathematical proof that the test statistics $T$ is $t_\nu$ distributed need the assumptions you stated. However, practically, this is rarely of interest, because if we would know that the $X_i$ are all i.i.d normal random variables with equal variance, we probably would know the mean value of their distribution as well. Thus, if you wish to evaluate a "large" empirical dataset you either accept the additional assumption that the dataset it large enough for the central limit theorem to be applicable -- to a good enough approximation -- or you use the Wilcoxon test instead. Also, keep in mind that the $\alpha$ risk is rather insensitive to the normality assumption, but the $\beta$ risk is not.