I've been wondering this question for a while ! There's a high probability that I totally misunderstand something but I want my mind to be clear about that point... Let me explain.
Basically, the central limit theorem (CLT) states that if we sum up $n$ identically distributed independent random variables $X_1, ..., X_n$, the distribution of the resulting random variable $S = \sum_{i=1}^n X_i$ tends toward a normal distribution as $n$ increases.
In practice, we always assume that if we have a large number of observations $x_1, ..., x_n$ the CTL does apply.
However, in practice, all these observations came from one random variable $X$ (or at least we assume they came from that one random variable). These observations are not $n$ random variables so why does the CTL apply ?
For example, let's assume we have the following dataset and we want to know if there's a significant salary difference between the 2 groups "S" (soccer) and "B" (basketball) in the United States:
| individuals | salary (in M$ per year) | sport (soccer or basketball) |
|---|---|---|
| 1 | 1M | S |
| ... | ... | ... |
| 999 | 8M | B |
| 1000 | 0.9M | S |
Skipping all the details, in this situation I would have done a t-test regardless of the salary distributions in the 2 groups as the CLT applies because of the large number of observations $n = 1000 >> 30$ with $n_S = n_B = 500$.
The point that I want to highlight is that we've used the large number of observations to justify the application of the CLT and not the large number of random variables... The t-test statistic uses the estimated mean of the sample but the estimated mean sum up the value of the observations not some random variables? Actually I can't find where these $X_1, ..., X_n$ random variables from the CLT are.
I'm very confused, I'd be glad if someone could clear my mind... Thank you.