A doubt concerning the central limit theorem and how it is presented in explanatory videos

107 Views Asked by At

As a mathematician currently pursuing a master's degree in machine learning, I aim to develop an intuitive understanding of statistics, particularly concerning the Central Limit Theorem (CLT). While I comprehend that the CLT implies the convergence of cumulative distribution functions, a specific aspect related to Bernoulli variables puzzles me.

In various educational videos on the Central Limit Theorem, there seems to be a strong implication that the frequency histogram of the probability density function for the variable $T_{n} = \frac{\sum_{i=1}^{n} X_{i}}{\sqrt{n}\,\sigma} - \sqrt{n}\,\mu$ converges uniformly on compact sets to the graph of the Gaussian distribution. I have reconstructed a possible definition for this sequences of histograms as a piecewise constant functions:

$$ \sum_{i=0}^{N-1} P(\{ T_{N} = s^{N}_{i} \}) \cdot \mathbb{1}_{[x^{N}_{i},x^{N}_{i+1})} $$

where each interval $[x^{N}_{i},x^{N}_{i+1})$ contains exactly one of the values $s_{i}^{N}$ assumed by $T_{N}$, and $N \in \mathbb{N}$.

However, despite watching numerous explanatory videos, there seems to be a gap in my understanding. I suspect that a crucial detail may be eluding me. I would greatly appreciate it if someone could shed light on this or provide a reference for further exploration.

2

There are 2 best solutions below

1
On BEST ANSWER

Direct feedback: In your question you give the following expression: $$ \sum_{i=0}^{N-1} P(\{ T_{N} = s^{N}_{i} \}) \cdot \mathbb{1}_{[x^{N}_{i},x^{N}_{i+1})} $$ This expression does not make sense here:

  • What is the definition of $s_i^N$? Is it random? Why are there only $N$ possible values of $s_i^N$?

  • What is the definition of your indicator function $1_{[x_i^N, x_{i+1}^N)}$? Is this a random variable? A function of some variable $x$?

  • What type of object is that expression? Is it a number? A random variable?


CLT Statement: If $\{X_i\}_{i=1}^{\infty}$ are independent and identically distributed (i.i.d.) with finite mean $\mu$ and finite and nonzero variance $\sigma^2$ then $$ \lim_{n\rightarrow\infty} P[T_n\leq x] = P[G\leq x] \quad \forall x \in \mathbb{R}$$ where $G \sim N(0,1)$ and $T_n = \frac{1}{\sqrt{n\sigma^2}}\sum_{i=1}^n(X_i-\mu)$ for $n \in \{1, 2, 3, ...\}$.


On histogram visualizations of the CLT: Sometimes $n=20$ is large enough for $T_{20}$ to have a distribution that is close to (but not the same as) a Gaussian $N(0,1)$ distribution.

How can we visualize this?

Well, we can first choose some distribution for the $X_i$ variables, say, $X_i \sim Unif(0,1)$ (so $\mu=1/2$, $\sigma^2=1/12$). We can make two intervals $I_1=[0.1, 0.2)$ and $I_2=[0.2, 0.3)$ and then, somehow, attempt to show \begin{align*} P[T_{20}\in I_1] &\approx P[G \in I_1] = \int_{0.1}^{0.2}\frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\\ P[T_{20} \in I_2] &\approx P[G \in I_2]= \int_{0.2}^{0.3}\frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\\ \end{align*} We can of course consider any finite collection of intervals that we want, but for simplicity lets just consider intervals $I_1$ and $I_2$.

The challenge is that the distribution of $T_{20}$ is very complicated (it is not Gaussian). To compute the PDF of $T_{20}$, we would need to convolve scaled versions of the uniform PDF with itself 20 times. Therefore, it is not easy to compute exact values of $P[T_{20} \in I_1]$ and $P[T_{20} \in I_2]$. However, it is easy to generate i.i.d. random variables $\{Y_i\}_{i=1}^{\infty}$, all having the same distribution as $T_{20}$. Then we can take a histogram of how many times the $Y_i$ samples fall into the given intervals and use the law of large numbers (LLN) to claim that this converges (with probability 1) to the exact probability for $P[T_{20} \in I_1]$ and $P[T_{20}\in I_2]$.

To do this lets arrange i.i.d. samples of the $U(0,1)$ distribution into a doubly-indexed collection of random variables: $X_{i,j}$ for $i \in \{1, 2, 3, ...\}$ and $j \in \{1, ..., 20\}$. Define $$ Y_i =\frac{1}{\sqrt{2\sigma^2}}\sum_{j=1}^{20}(X_{i,j}-\mu) \quad \forall i \in \{1, 2, 3, ...\}$$ Since for different indices $i$ we are using different samples $X_{i,j}$, we know $\{Y_i\}_{i=1}^{\infty}$ are i.i.d. with distribution the same as $T_{20}$. Thus, the random variables $\{W_i\}_{i=1}^{\infty}$ defined below are also i.i.d., as are $\{Z_i\}_{i=1}^{\infty}$: $$ W_i = 1_{\{Y_i \in I_1\}} \quad \forall i \in \{1, 2, 3, ...\}$$ $$ Z_i = 1_{\{Y_i \in I_2\}} \quad \forall i \in \{1, 2, 3, ...\}$$ By the LLN we know (with prob 1): $$\lim_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^nW_i = E[W_1] = P[Y_1 \in I_1] = P[T_{20}\in I_1]$$ $$\lim_{n\rightarrow\infty} \frac{1}{n}\sum_{i=1}^nZ_i = E[Z_1] = P[Y_1 \in I_2] = P[T_{20}\in I_2]$$

Observe that $\frac{1}{n}\sum_{i=1}^n W_i$ and $\frac{1}{n}\sum_{i=1}^n Z_i$ are just histogram values for the fraction of time being in the intervals $I_1$ and $I_2$. If $n$ is large then (with high probability) these are close to the exact values of $P[T_{20}\in I_1]$ and $P[T_{20} \in I_2]$.

Now, taking $n\rightarrow\infty$ does not give any convergence to a Gaussian. No, it only (by the law of large numbers) gives convergence with probability 1 to the exact values $P[T_{20}\in I_1]$ and $P[T_{20} \in I_2]$. Since 20 is "large" we may expect these exact values to be close to $P[G \in I_1]$ and $P[G \in I_2]$. If we like, we can directly compare the results by using experimental data.

If you want to visualize a closer convergence to a Gaussian, you can take batches of size 30 (rather than batches of size 20). Then we can approximate $P[T_{30} \in I_1]$ and $P[T_{30} \in I_2]$, which will be closer approximations to $P[G \in I_1]$ and $P[G \in I_2]$.

3
On

Here is a statement and proof of the related result in comments.

Claim: Suppose that

  • We have (possibly discontinuous) functions $\{F_n\}_{n=1}^{\infty}$ of the form $F_n:\mathbb{R}\rightarrow\mathbb{R}$.

  • For each $n\in\{1,2,3,...\}$, $F_n(x)$ is nondecreasing in $x$ (so $x<y\implies F_n(x)\leq F_n(y)$).

  • We have a continuous function $F:\mathbb{R}\rightarrow\mathbb{R}$.

  • $\lim_{n\rightarrow\infty} F_n(x) = F(x)$ for all $x \in \mathbb{R}$.

Then for each compact interval $[a,b] \subseteq \mathbb{R}$ and each $\epsilon>0$, there is a positive integer $m$ such that $$ \sup_{x \in [a,b]}|F_n(x)-F(x)|\leq \epsilon \quad \forall n \geq m$$ This is called "uniform convergence over compact sets."

Note: We can apply this to your CLT question by defining $F_n$ as the CDF function of $T_n$ and $F$ as the CDF of a $N(0,1)$ Gaussian. Then $F_n(x)$ is nondecreasing in $x$ for each fixed $n$. Also, $F$ is continuous. So if we believe the standard CLT result that $\lim_{n\rightarrow\infty} F_n(x)=F(x)$ for all $x \in \mathbb{R}$, the above claim also gives us uniform convergence over compact sets.

Proof of claim: Fix interval $[a,b]$ (assume $a<b$). Fix $\epsilon>0$. Since $F$ is continuous it is uniformly continuous over the compact interval $[a,b]$. So there is a $\delta>0$ such that if $x,y \in [a,b]$ then $$(|x-y|\leq \delta) \implies |F(x)-F(y)|\leq \epsilon/2$$ Chop $[a,b]$ into $k$ equally spaced subintervals, each of size no more than $\delta$. Let $\{x_i\}_{i=0}^k$ be the grid points (so $x_0=a$, $x_k=b$). Choose $m$ such that $n\geq m$ implies $|F_n(x_i)-F(x_i)|\leq \epsilon/2$ for all $i \in \{0, ..., k\}$.

Now fix $y \in [a,b]$ and $n\geq m$. Note that $x_i\leq y\leq x_{i+1}$ for some $i \in\{0,1,...,k\}$. Since $F_n(x)$ is nondecreasing in $x$, by considering the two cases either $F_n(y)\leq F(y)$ or $F_n(y)>F(y)$, we get: $$|F_n(y)-F(y)|\leq \max\{|F_n(x_i)-F(y)|, |F_n(x_{i+1})-F(y)|\} \quad (*)$$ On the other hand by the triangle inequality we get $$ |F_n(x_i)-F(y)|\leq |F_n(x_i)-F(x_i)|+|F(x_i)-F(y)|\leq \epsilon/2 + \epsilon/2 = \epsilon$$ since $n\geq m$ and $|x_i-y|\leq \delta$. Similarly $$ |F_n(x_{i+1})-F(y)|\leq \epsilon$$ It follows from (*) that $$|F_n(y)-F(y)|\leq \epsilon$$ This holds for all $y \in [a,b]$. $\Box$