Assumption of Normality in Central Limit Theorem

414 Views Asked by At

Regarding the distributions of random variables, I understand the following points:

  • The distribution of the expected value (i.e. "mean") of any random variable is (asymptotically) normally distributed - this is stated within the Central Limit Theorem

  • The sums and differences of normally distributed random variables are also normally distributed

This being said, now consider the popular "T-Test" . The T-Test can be used to determine if the difference between the "mean value" of some random variable from two samples are equal or not - in this case, we can consider this "value" as a random variable. And since we know that the difference in the "mean values" of any two random variables follows a normal distribution, the T-Test exploits this fact and thereby uses the normal distribution to determine the mean differences between two samples is statistically significant or not.

Something I have never quite been able to understand: In a T-Test, we are told that we require the underlying distribution of both samples to be normally distributed - yet, the difference between the mean values of ANY two random variables is normally distributed.

My Question: Thus, why is the assumption of normality required in a T-Test when we know that the difference between the mean values of any two random variables is always normally distributed (provided there are enough observations)?

I can understand why this might not be the case when we have very few observations in each sample - but when we have many observations in both samples, why is the assumption of bormality still required for the T-Test?

Thanks!

3

There are 3 best solutions below

0
On

My understanding is that the normality assumption is not really required.

As you point out, the Central Limit Theorem helps here, but the real issue is how much it helps. This depends on how much you're violating the normality assumption and how much data you have. Maybe you have enough data---but maybe you don't.

Other discussions about this issue from CrossValidated: 1, 2, 3, 4, 5, 6.

0
On

The thing about asymptotic behaviour is that it describes something that happens in the limit going to infinity, but it says nothing about what happens in any finite range. Knowing that $\bar{X}$ is asymptotically normal as $n \rightarrow \infty$ doesn't mean that you can pick a value of $n$ and say "it's approximately normal now". If the initial distribution is "nice" then maybe once you reach $n = 30$ the middle of the distribution of $\bar{X}$ might be somehow "normal enough" for most purposes, but you'll still have distinctly non-normal behaviour out in the tails; or maybe the initial distribution is wide and lumpy, in which case even at $n = 100$ the distribution of the mean still doesn't actually look anything like a normal distribution.

3
On

▶ The phrase

"The distribution of the expected value (i.e. "mean") of any random variable is (asymptotically) normally distributed - this is stated within the Central Limit Theorem".

is misleading, because the expected value $\mathbf{E}[X] = \int_{\mathbb{R}} x \, \mathrm{d}F_X(x)$ of a random variable is not a random variable, but a number. What is true is that:

The distribution of the sample mean $\overline{X}=\frac{X_1+X_2+\cdots+X_n}{n}$ of the i.i.d. samples $X_1, X_2, \ldots, X_n$, drawn from a distribution with finite variance, is asymptotically normal for sufficiently large $n$.

You should really need to distinguish the notion of expected value (which can be thought of as population mean in statistics) and sample mean.

▶ Now let's move on to the following phrase:

"In a T-Test, we are told that we require the underlying distribution of both samples to be normally distributed - yet, the difference between the mean values of ANY two random variables is normally distributed."

What do you mean in the last sentence? The difference $X - Y$ of any two random variables $X$ and $Y$ need not be normally distributed. $X - Y$ becomes normally distributed if $X$ and $Y$ are (jointly) normal.

▶ Finally, let's give a look on OP's question.

"Thus, why is the assumption of normality required in a T-Test when we know that the difference between the mean values of any two random variables is always normally distributed (provided there are enough observations)?"

Note that the T-test is derived on the assumption that

  1. the sample means follow normal distributions,
  2. the sample variance follows a scaled $\chi^2$ distribution, and
  3. the sample mean and sample variance are statistically independent.

A two-sample location test requires an additional assumption on the independence of two groups of samples.

Anyway, we can still try to exploit the T-test to the cases where these assumptions holds only approximately.

  • Item 1 does not cause much troubles, since CLT tells that sample mean $\overline{X}$ is asymptotically normal when the sample is large and drawn from a population distribution with finite variance. In fact, CLT is so powerful for sample mean that often the sample size only needs to be moderately large in order for $\overline{X}$ to be approximately normal.

  • Item 2 and 3 are not so terribly bad. If the sample size is large enough and the kurtosis of the population distribution is finite, then the sample variance $S^2$ is approximately normally distributed. (See this for a proof.)

    On the other hand, the speed of asymptotic normality of $S^2$ seems much slower than that of $\overline{X}$. Although I have no good way of formalizing this idea, numerical simulations with various choices of population distribution seem supporting this phenomenon. In particular, the distribution of $S^2$ can deviate significantly from the $\chi^2$ distribution even when $\overline{X}$ is already approximately normal.

    Below is the result of a numerical simulation. This simulation consists of $10^5$ trials, where in each trial, I sampled $n$ values from the exponential distribution with unit rate and then computed three statistics:

    • standardized sample mean, $\sqrt{n}(\overline{X} - \mu)/\sigma$
    • standardized sample variance, $\sqrt{n}(S^2 - \sigma^2)/\sqrt{\mu_4 - \sigma^2}$
    • t-statistic $t$

    By varying the value of $n$, we see that $\overline{X}$ becomes approximately normal much faster than $S^2$ and $t$:

Simulation