Formal definition of conditional probability

12.6k Views Asked by At

It would be extremely helpful if anyone gives me the formal definition of conditional probability and expectation in the following setting, given probability space $ (\Omega, \mathscr{A}, \mu ) $ with $\mu(\Omega) = 1 $, and a random variable $ X : \Omega \rightarrow \mathbb{R}^n $, where for any borel set $ A \in \mathscr{B}(\mathbb{R}^n) $ we define $$ \mathbb{P}(X \in A) = (X_*\mu)(A) = \mu(X^{-1}(A))= \mu(\{\omega\in \Omega\ \ |\ \ X(\omega) \in A\})\ \ \text{and}\ \ \mathbb{E}(X) = \int_\Omega Xd\mu $$ Regardless of $X, Y$ being discrete or continuous (with density $f_X, f_Y $ and joint density $f_{X,Y} $ w.r.t some measure $\nu$ on $\mathbb{R}^n $), I am asking for the definition of $ \mathbb{P}(Y\in B\ |\ X \in A) $ and $ \mathbb{E}(Y|X) $ for all Borel sets $ A, B \in \mathscr{B}(\mathbb{R}^n) $, keeping in mind that $ \mathbb{P}(X \in A) $ may well be zero.

In our probability class some thing of the following sort was mentioned, where $\delta_x$ is the Dirac distribution at $ x $, then we have $$ \mathbb{E}(Y|X = x) = \frac{\mathbb{E}(\delta_x(X)Y)}{\mathbb{P}(X=x)}$$ out of which I can't make any sense. Any appropiate reference for these is also very much welcome.

Thank you.

1

There are 1 best solutions below

24
On

Let throughout this post $(\Omega,\mathcal{F},P)$ be a probability space, and let us first define the conditional expectation ${\rm E}[X\mid\mathcal{G}]$ for integrable random variables $X:\Omega\to\mathbb{R}$, i.e. $X\in L^1(P)$, and sub-sigma-algebras $\mathcal{G}\subseteq\mathcal{F}$.

Definition: The conditional expectation ${\rm E}[X\mid\mathcal{G}]$ of $X$ given $\mathcal{G}$ is the random variable $Z$ having the following properties:

(i) $Z$ is integrable, i.e. $Z\in L^1(P)$.

(ii) $Z$ is ($\mathcal{G},\mathcal{B}(\mathbb{R}))$-measurable.

(iii) For any $A\in\mathcal{G}$ we have $$ \int_A Z\,\mathrm dP=\int_A X\,\mathrm dP. $$

Note: It makes sense to talk about the conditional expectation since if $U$ is another random variable satisfying (i)-(iii) then $U=Z$ $P$-a.s.

Definition: If $X\in L^1(P)$ and $Y:\Omega\to\mathbb{R}$ is any random variable, then the conditional expectation of $X$ given $Y$ is defined as $$ {\rm E}[X\mid Y]:={\rm E}[X\mid\sigma(Y)], $$ where $\sigma(Y)=\{Y^{-1}(B)\mid B\in\mathcal{B}(\mathbb{R})\}$ is the sigma-algebra generated by $Y$.

I'm not aware of any other definition of $P(Y\in B\mid X\in A)$ than the obvious, i.e. $$ P(Y\in B\mid X\in A)=\frac{P(Y\in B,X\in A)}{P(X\in A)} $$ provided that $P(X\in A)>0$. The only exception being when $A$ contains a single point, i.e. $A=\{x\}$ for some $x\in\mathbb{R}$. In this case, the object $P(Y\in B\mid X=x)$ is defined in terms of a regular conditional distribution.

Let us first define regular conditional probabilities. Let $X:\Omega\to\mathbb{R}$ be a random variable.

Definition: A regular conditional probability for $P$ given $X$ is a function $$ \mathcal{F}\times \mathbb{R} \ni(A,x)\mapsto P^X(A\mid x) $$ satisfying the following three conditions:

(i) The mapping $A\mapsto P^X(A\mid x)$ is a probability measure on $(\Omega,\mathcal{F})$ for all $x\in \mathbb{R}$.

(ii) The mapping $x\mapsto P^X(A\mid x)$ is $(\mathcal{B}(\mathbb{R}),\mathcal{B}(\mathbb{R}))$-measurable for all $A\in\mathcal{F}$.

(iii) The defining equation holds: For any $A\in\mathcal{F}$ and $B\in\mathcal{B}(\mathbb{R})$ we have $$ \int_B P^X(A\mid x)\,P_X(\mathrm dx)=P(A\cap\{X\in B\}). $$

Note: A mapping satisfying (i) and (ii) is often called a Markov kernel. Furthermore, since $(\mathbb{R},\mathcal{B}(\mathbb{R}))$ is a nice space, the regular conditional probability is unique in the sense that if $\tilde{P}^X(\cdot\mid\cdot)$ is another regular conditional probability of $P$ given $X$, then we have that $P^X(\cdot\mid x)=\tilde{P}^X(\cdot\mid x)$ for $P_X$-a.a. $x$. Here $P_X=P\circ X^{-1}$ is the distribution of $X$.

Connection: Let $P^X(\cdot\mid\cdot)$ be a regular conditional probability of $P$ given $X$. Then for any $A\in\mathcal{F}$ we have $$ {\rm E}[1_A\mid X]=\varphi(X), $$ where $\varphi(x)=P^X(A\mid x)$. In short we write ${\rm E}[1_A\mid X]=P^X(A\mid X)$.

Now let us introduce another random variable $Y:\Omega\to\mathbb{R}$, and $P^X(\cdot\mid \cdot)$ still denotes a regular conditional probability of $P$ given $X$.

Definition: For $B\in\mathcal{B}(\mathbb{R})$ and $x\in\mathbb{R}$ we define the regular conditional distribution of $Y$ given $X$ by $$ P_{Y\mid X}(B\mid x):=P^X(Y\in B\mid x). $$

Instead of $P_{Y\mid X}(B\mid x)$ one often writes $P(Y\in B\mid X=x)$.

An easy consequence of this definition is that $(B,x)\mapsto P_{Y\mid X}(B\mid x)$ is a Markov kernel and for any $A,B\in\mathcal{B}(\mathbb{R})$ we have $$ \int_A P_{Y\mid X}(B\mid x)\,P_X(\mathrm dx)=P(\{X\in A\}\cap\{Y\in B\}). \tag{1} $$

In fact, $P_{Y\mid X}(\cdot \mid \cdot)$ is a regular conditional distribution of $Y$ given $X$ if and only if $P_{Y\mid X}(\cdot\mid\cdot)$ is a Markov kernel and satisfies $(1)$. Again $(1)$ is often referred to as the defining equation.

Definition: Let $P^X(\cdot\mid\cdot)$ be a regular conditional probability of $P$ given $X$. Furthermore, let $U:\Omega\to\mathbb{R}$ be another random variable that is assumed bounded (to ensure the following expectations exist). Then we define the (regular) conditional mean of $U$ given $X=x$ by $$ {\rm E}[U\mid X=x]:=\int_\Omega U(\omega)\, P^X(\mathrm d\omega\mid x). $$

Let us denote $\psi(x)={\rm E}[U\mid X=x]$. Then we have the following:

Connection: The mapping $\mathbb{R}\ni x\mapsto \psi(x)$ is $(\mathcal{B}(\mathbb{R}),\mathcal{B}(\mathbb{R}))$-measurable, and $$ {\rm E}[U\mid X]=\psi(X). $$

The following is an extremely useful rule when calculating with conditional distributions:

Rule: Let $X$ and $Y$ be as above, and let $\xi:\mathbb{R}^2\to\mathbb{R}$ be $(\mathcal{B}(\mathbb{R}^2),\mathcal{B}(\mathbb{R}))$-measurable. Then $$ P(\xi(X,Y)\in D\mid X=x)=P(\xi(x,Y)\in D\mid X=x),\quad D\in\mathcal{B}(\mathbb{R}), $$ holds for $P_X$-a.a. $x$. This is saying that "conditional on $X=x$ we may replace $X$ by $x$".

The following example shows how this rule can be useful: Let $X$ and $Y$ be independent $\mathcal{N}(0,1)$ random variables, and let $U=X+Y$. Then we claim that $U\mid X=x\sim \mathcal{N}(x,1)$ for $P_X$-a.a. $x$. To see this, note that by the rule above, the distribution of $U\mid X=x$ and $Y+x\mid X=x$ is the same. But since $Y$ is independent of $X$ we have that $Y+x\mid X=x$ is distributed as $Y+x$. We can write it as follows: $$ U\mid X=x\sim Y+x\mid X=x\sim Y+x\sim\mathcal{N}(x,1). $$