Converges in distribution is preserved under continuous transformations

990 Views Asked by At

$X_n$ converges in distribution to X implies $X_n^{2}$ also converges in distribution to $X^{2}$.

As i don't know measure theory and hence don't understand continuous mapping theorem,can i have an elementary proof using basic definitions of convergence and elementary real analysis?

Thanks for your concern in advance.

2

There are 2 best solutions below

0
On

What, exactly, is your definition of "convergence in distribution"?

The following is based on the one I use. (It might take an additional chunk of reasoning to show some other definition is equivalent to this one; this additional reasoning might well be part of the continuous mapping theorem you are uncomfortable with.)

To say $X_n$ converges in distribution to $X$ means that for each bounded continuous function $b(\cdot)$, the sequence of numbers $Eb(X_n)$ converges to the number $Eb(X)$.

To show $X_n^2$ converges in distribution to $X^2$ we only need to show that for each bounded continuous function $c(\cdot)$, we have $Ec(X_n^2)\to Ec(X^2)$. Ok, given such a $c$ consider the particular $b$ given by $b(x)=c(x^2)$. Is $b$ continuous? Yes, because it is the composition of two continuous functions. Is it bounded? Yes, because $c$ is bounded. So $b$ is continuous and bounded, and since the $X_n$ converge to $X$ in distribution, $Eb(X_n)\to Eb(X)$. But $b(X_n)=c(X_n^2)$ and $Eb(X_n)=Ec(X_n^2)$, and similarly $Eb(X)=Ec(X^2)$. So the thing we need to test is true: if $c$ is continuous and bounded, it turns out $\lim_n Ec(X_n^2) = Ec(X^2)$, as desired.

Of course, there is nothing special about $x\mapsto x^2$ here: any continuous function $x\mapsto \phi(x)$ would work as well: $b(x)=c(\phi(x))$ is continuous is $c$ and $\phi$ are, and is bounded if $c$ is.

0
On

Given a sequence of real-valued random variables $(X_n)_{n \geq 1}$ with distribution functions $(F_n)_{n \geq 1}$, classically the definition of convergence of $X_n$ in distribution to a random variable $X$ with distribution function $F$ is that on the set of continuity points of $F$, $F_n$ converges pointwise to $F$. It's not too hard to directly use this definition here.

Here's an interpretation of continuity points that will come in handy: if $F$ is a distribution function of a random variable $X$, then recall that the probability mass function $p_X(a) = \Bbb P(X = a)$ can be recovered from the distribution function by $p_X(a) = F(a) - \lim_{x \uparrow a} F(x)$. If $a$ is a continuity point of $F$, this is zero, and conversely as well, since, being a distribution function, $F$ is continuous at $a$ iff it's left-continuous at $a$. So $a$ is a continuity point whenever there is no "mass" at $a$ (in measure-theoretic terminology, this is the same as saying the probability measure on $\mathbf{R}$ induced by $X$ is "non-atomic").

In our situation, let's call the distribution functions of $X_n^2$ to be $G_n$ for all $n \geq 1$ and distribution function of $X^2$ to be $G$. It's easy to check that $G_n(x) = F_n(x^{1/2}) - F_n(-x^{1/2}) + p_{X_n}(-x^{1/2})$ (and similarly, without the index $n$, for $G$). If $x$ is a continuity point of $G$, then by the previous observations, $\Bbb P(X^2 = x) = 0$. As $(X = \pm x^{1/2})$ are subevents of $(X^2 = x)$, this implies $p_X(x^{1/2}) = p_X(-x^{1/2}) = 0$, therefore $\pm x^{1/2}$ are continuity points of $F$.

As $X_n$ converges in distribution to $X$, whenever $x$ is a continuity point of $G$, $\pm x^{1/2}$ are continuity points of $F$, so that $F_n(x^{1/2}) - F_n(-x^{1/2}) \to F(x^{1/2}) - F(-x^{1/2})$. To deal with the point-mass term, note that as $p_X(-x^{1/2}) = 0$, for any $\varepsilon > 0$ there is a $y \in (-\infty, -x^{1/2})$ such that $F(-x^{1/2}) - F(y) < \varepsilon/2$. Moreover we can choose $y$ to be a continuity point of $F$, as the set of continuity points of a distribution function is dense. Therefore, for some $N > 0$, $|[F_n(-x^{1/2}) - F_n(y)] - [F(-x^{1/2}) - F(y)]| < \varepsilon/2$ whenever $n \geq N$. Hence, $F_n(-x^{1/2}) - F_n(y) $$= \Bbb P(y < X_n \leq -x^{1/2}) < \varepsilon$ for all $n \geq N$. $(X_n = -x^{1/2})$ is a subevent of $(y < X_n \leq -x^{1/2})$, so $\Bbb P(X_n = -x^{1/2}) < \varepsilon$ for all $n \geq N$. Therefore, $\lim_n p_{X_n}(-x^{1/2}) = 0 = p_X(-x^{1/2})$.

Putting all the pieces togather, we get $\lim G_n(x) = G(x)$ whenever $x$ is a continuity point of $G$. This proves $X_n^2$ converges in distribution to $X^2$ whenever $X_n$ converges in distribution to $X$.

Some more abstract measure theoretic comments in case you're interested: A mildly surprising fact about convergence in distribution is that if $X_n$ is a sequence of real-valued random variables converging in distribution to $X$ then one can find another sequence $(Y_n)_{n \geq 1}$ of random variables, and a random variable $Y$ defined on a common probability measure space $(\Omega, \mathcal{A}, \Bbb P)$ such that $Y_n$ and $Y$ are equal in distribution to $X_n$ and $X$ respectively, and $Y_n$ converges to $Y$ in the $\Bbb P$-almost everywhere (or "almost sure", if you're a probabilist) sense.

This is known as the Skorokhod representation theorem. The continuous mapping theorem you refer to is a not-so-hard corollary of this: Given any continuous function $\phi \in C^0(\mathbf{R})$, $\{\omega \in \Omega : \lim Y_n(\omega) = Y(\omega)\}$ is a subset of $\{\omega \in \Omega : \lim \phi(Y_n(\omega)) = \phi(Y(\omega))\}$ (continuous functions preserve limit) and since the former is a sure event as $Y_n \to Y$ almost everywhere, the latter is a sure event as well. This implies $\phi(Y_n) \to \phi(Y)$ almost everywhere, hence in particular in distribution as well. Replacing the random variables by equal-in-distribution random variables doesn't change convergence in distribution (that's why it's so weak, it's only about convergence at the level of the distribution function), this proves $\phi(X_n) \to \phi(X)$ in distribution.

The equivalent definition given in the existing answer is also a corollary of Skorokhod representation; given $X_n$ converging in distribution to $X$, find $Y_n$ and $Y$ as before. If $f \in C^0_b(\mathbf{R})$ is bounded continuous, say with uniform bound $|f| < M$, then $f(Y_n) \to f(Y)$ almost everywhere and $|f(Y_n)| < M$ implies, by dominated (bounded) convergence theorem, $\Bbb E f(Y_n) \to \Bbb E f(Y)$ and since $f(Y_n) \stackrel{d}{=} f(X_n)$ and $f(Y) \stackrel{d}{=} f(X)$, $\Bbb E f(X_n) \to \Bbb E f(X)$. The converse doesn't require any big machinery and you can try to prove it yourself: if $\Bbb E f(X_n) \to \Bbb E f(X)$ for every $f \in C^0_b(\mathbf{R})$, explicitly approximate the indicator function $\chi_{(-\infty, x]}$ by a sequence of functions in $C^0_b(\mathbf{R})$, and if you can prove $\Bbb E \chi_{(-\infty, x]}(X_n) \to \Bbb E \chi_{(-\infty, x]}(X)$ (for all continuity points $x$ of $F_X$), you're done (expectation of indicator of an event is probability of the event!). The advantage of the definition above is you can generalize the usual notion of convergence in distribution for random variables with values in a metric space, or even just measures on a metric space.

A reference for all this, including a proof of Skorokhod representation, is Durrett (fourth edition), Chapter 3.2. The idea is simply that if $X_n$ is a sequence of random variables converging in distribution to $X$, with the associated distribution functions $F_n$ and $F$ respectively, take the uniform distribution $U \sim \text{Unif}(0, 1)$ (which, in a measure theoretic sense, is nothing but the identity map $U : ((0, 1), \mathcal{B}_{(0, 1)}, \lambda) \to \mathbf{R}$, where the domain in $(0, 1)$ with the standard Borel $\sigma$-algebra, and the Lebesgue measure), and simulate $Y_n$ and $Y$ as $F_n^{\leftarrow}(U)$ and $F^{\leftarrow}(U)$ where for a distribution function $H$, $H^{\leftarrow}(x) := \sup\{y \in \mathbf{R} : H(y) < x\}$ is the generalized inverse of $H$. $Y_n$ ($n \geq 1$) and $Y$ are now defined on the same measure space $((0, 1), \mathcal{B}_{(0, 1)}, \lambda)$ and $Y_n \stackrel{d}{=} X_n$, $Y \stackrel{d}{=} X$, and indeed $Y_n \to Y$ almost everywhere (I can add more details if needed, but it's all in Durrett, Theorem 3.2.2). Somehow the idea is one has coupled $X_n$ and $X$ to "straighten" the random variables pointwise without changing the distribution.