As a mathematician currently pursuing a master's degree in machine learning, I aim to develop an intuitive understanding of statistics, particularly concerning the Central Limit Theorem (CLT). While I comprehend that the CLT implies the convergence of cumulative distribution functions, a specific aspect related to Bernoulli variables puzzles me.
In various educational videos on the Central Limit Theorem, there seems to be a strong implication that the frequency histogram of the probability density function for the variable $T_{n} = \frac{\sum_{i=1}^{n} X_{i}}{\sqrt{n}\,\sigma} - \sqrt{n}\,\mu$ converges uniformly on compact sets to the graph of the Gaussian distribution. I have reconstructed a possible definition for this sequences of histograms as a piecewise constant functions:
$$ \sum_{i=0}^{N-1} P(\{ T_{N} = s^{N}_{i} \}) \cdot \mathbb{1}_{[x^{N}_{i},x^{N}_{i+1})} $$
where each interval $[x^{N}_{i},x^{N}_{i+1})$ contains exactly one of the values $s_{i}^{N}$ assumed by $T_{N}$, and $N \in \mathbb{N}$.
However, despite watching numerous explanatory videos, there seems to be a gap in my understanding. I suspect that a crucial detail may be eluding me. I would greatly appreciate it if someone could shed light on this or provide a reference for further exploration.
Direct feedback: In your question you give the following expression: $$ \sum_{i=0}^{N-1} P(\{ T_{N} = s^{N}_{i} \}) \cdot \mathbb{1}_{[x^{N}_{i},x^{N}_{i+1})} $$ This expression does not make sense here:
What is the definition of $s_i^N$? Is it random? Why are there only $N$ possible values of $s_i^N$?
What is the definition of your indicator function $1_{[x_i^N, x_{i+1}^N)}$? Is this a random variable? A function of some variable $x$?
What type of object is that expression? Is it a number? A random variable?
CLT Statement: If $\{X_i\}_{i=1}^{\infty}$ are independent and identically distributed (i.i.d.) with finite mean $\mu$ and finite and nonzero variance $\sigma^2$ then $$ \lim_{n\rightarrow\infty} P[T_n\leq x] = P[G\leq x] \quad \forall x \in \mathbb{R}$$ where $G \sim N(0,1)$ and $T_n = \frac{1}{\sqrt{n\sigma^2}}\sum_{i=1}^n(X_i-\mu)$ for $n \in \{1, 2, 3, ...\}$.
On histogram visualizations of the CLT: Sometimes $n=20$ is large enough for $T_{20}$ to have a distribution that is close to (but not the same as) a Gaussian $N(0,1)$ distribution.
How can we visualize this?
Well, we can first choose some distribution for the $X_i$ variables, say, $X_i \sim Unif(0,1)$ (so $\mu=1/2$, $\sigma^2=1/12$). We can make two intervals $I_1=[0.1, 0.2)$ and $I_2=[0.2, 0.3)$ and then, somehow, attempt to show \begin{align*} P[T_{20}\in I_1] &\approx P[G \in I_1] = \int_{0.1}^{0.2}\frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\\ P[T_{20} \in I_2] &\approx P[G \in I_2]= \int_{0.2}^{0.3}\frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\\ \end{align*} We can of course consider any finite collection of intervals that we want, but for simplicity lets just consider intervals $I_1$ and $I_2$.
The challenge is that the distribution of $T_{20}$ is very complicated (it is not Gaussian). To compute the PDF of $T_{20}$, we would need to convolve scaled versions of the uniform PDF with itself 20 times. Therefore, it is not easy to compute exact values of $P[T_{20} \in I_1]$ and $P[T_{20} \in I_2]$. However, it is easy to generate i.i.d. random variables $\{Y_i\}_{i=1}^{\infty}$, all having the same distribution as $T_{20}$. Then we can take a histogram of how many times the $Y_i$ samples fall into the given intervals and use the law of large numbers (LLN) to claim that this converges (with probability 1) to the exact probability for $P[T_{20} \in I_1]$ and $P[T_{20}\in I_2]$.
To do this lets arrange i.i.d. samples of the $U(0,1)$ distribution into a doubly-indexed collection of random variables: $X_{i,j}$ for $i \in \{1, 2, 3, ...\}$ and $j \in \{1, ..., 20\}$. Define $$ Y_i =\frac{1}{\sqrt{2\sigma^2}}\sum_{j=1}^{20}(X_{i,j}-\mu) \quad \forall i \in \{1, 2, 3, ...\}$$ Since for different indices $i$ we are using different samples $X_{i,j}$, we know $\{Y_i\}_{i=1}^{\infty}$ are i.i.d. with distribution the same as $T_{20}$. Thus, the random variables $\{W_i\}_{i=1}^{\infty}$ defined below are also i.i.d., as are $\{Z_i\}_{i=1}^{\infty}$: $$ W_i = 1_{\{Y_i \in I_1\}} \quad \forall i \in \{1, 2, 3, ...\}$$ $$ Z_i = 1_{\{Y_i \in I_2\}} \quad \forall i \in \{1, 2, 3, ...\}$$ By the LLN we know (with prob 1): $$\lim_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^nW_i = E[W_1] = P[Y_1 \in I_1] = P[T_{20}\in I_1]$$ $$\lim_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^nZ_i = E[Z_1] = P[Y_1 \in I_2] = P[T_{20}\in I_2]$$
Observe that $\frac{1}{n}\sum_{i=1}^n W_i$ and $\frac{1}{n}\sum_{i=1}^n Z_i$ are just histogram values for the fraction of time being in the intervals $I_1$ and $I_2$. If $n$ is large then (with high probability) these are close to the exact values of $P[T_{20}\in I_1]$ and $P[T_{20} \in I_2]$.
Now, taking $n\rightarrow\infty$ does not give any convergence to a Gaussian. No, it only (by the law of large numbers) gives convergence with probability 1 to the exact values $P[T_{20}\in I_1]$ and $P[T_{20} \in I_2]$. Since 20 is "large" we may expect these exact values to be close to $P[G \in I_1]$ and $P[G \in I_2]$. If we like, we can directly compare the results by using experimental data.
If you want to visualize a closer convergence to a Gaussian, you can take batches of size 30 (rather than batches of size 20). Then we can approximate $P[T_{30} \in I_1]$ and $P[T_{30} \in I_2]$, which will be closer approximations to $P[G \in I_1]$ and $P[G \in I_2]$.