Conditional probability on zero probability events (Application)

551 Views Asked by At

This question can be regarded as an application to this question.

Let $(\Omega ,{\mathcal {F}},P)$ be a probability space, $Z_1$ is a $(M_1,{\mathcal {M}_1})$-value random variable (that is it is $(\mathcal{F},\mathcal{M}_1)$ measurable), $Z_2$ is a $(M_2,{\mathcal {M}}_2)$-value random variable, $W$ is a $(N,{\mathcal {N}})$-value random variable, and $g$ is a measurable function. Assume that $Z_1$ and $Z_2$ are independent given any realization $w$ of $W$ (For example, $Z_1 = W + X_1$ and $Z_2 = W + X_2$ and $X_1$ and $X_2$ are independent)

How can we use the definition to show that $P(Z_1 \in A_1, Z_2 \in A_2 \vert g(Z_1) \in B, W=w) = P(Z_1 \in A_1 \vert g(Z_1) \in B, W=w)P(Z_2 \in A_2 \vert W=w)?$

This is quite intuitive that this should be true and I know that if $W$ is a discrete random variable, we can show the correctness of it using Bayes's rule. But, because of the fact that $P(W=w)$ can be zero, I don't know how to proceed in the general case.

My solution: According to here, if $\mathcal{N}$ is a Borel sigma-algebra, and $g$ is Borel measurable, then: \begin{align} &P(Z_1 \in A_1, Z_2 \in A_2 \vert g(Z_1) \in B, W=w)= \\ &\lim_{r \mapsto 0} P(Z_1 \in A_1, Z_2 \in A_2 \vert g(Z_1) \in B, W \in (w-r,w+r)) = \\ &\lim_{r \mapsto 0} \dfrac{P(Z_1 \in A_1, Z_2 \in A_2, g(Z_1) \in B \vert W \in (w-r,w+r))}{P(g(Z_1) \in B \vert W \in (w-r,w+r))} = \\ &\lim_{r \mapsto 0} \dfrac{P(Z_1 \in A_1, g(Z_1) \in B \vert W \in (w-r,w+r))P(Z_2 \in A_2 \vert W \in (w-r,w+r))}{P(g(Z_1) \in B \vert W \in (w-r,w+r))} = \\ &\lim_{r \mapsto 0} P(Z_1 \in A_1 \vert g(Z_1) \in B, W \in (w-r,w+r))P(Z_2 \in A_2 \vert W \in (w-r,w+r))= \\ & P(Z_1 \in A_1 \vert g(Z_1) \in B, W=w)P(Z_2 \in A_2 \vert W=w). \end{align} Is this correct if $\mathcal{N}$ is a Borel sigma-algebra, and $g$ is Borel measurable?

1

There are 1 best solutions below

12
On BEST ANSWER

Here's a short and sweet derivation of the equation you are interested in, followed by a detailed explanation.

$$ \begin{align} P\left(Z_1 \in A_1, Z_2 \in A_2\ |\ \mathbb{1}_{\{g(Z_1) \in B\}},\ W\right) &= E\left(\mathbb{1}_{A_1}(Z_1)\mathbb{1}_{A_2}(Z_2)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right) \\ &= E\left(E\left(\mathbb{1}_{A_1}(Z_1)\mathbb{1}_{A_2}(Z_2)\ |\ Z_1, W\right)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right)\\ &= E\left(\mathbb{1}_{A_1}(Z_1)\ E\left(\mathbb{1}_{A_2}(Z_2)\ |\ Z_1, W\right)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right)\\ &= E\left(\mathbb{1}_{A_1}(Z_1)\ E\left(\mathbb{1}_{A_2}(Z_2)\ |\ W\right)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right)\\ &= E\left(\mathbb{1}_{A_1}(Z_1)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right) E\left(\mathbb{1}_{A_2}(Z_2)\ |\ W\right)\\ &= P\left(Z_1 \in A_1\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right) P\left(Z_2 \in A_2\ |\ W\right). \end{align} $$


OK, so regarding that question of yours, the most important thing for you to take from my answer to it is not the formula $$ P(X\in A\ |\ Y=y) = \lim_{\Delta y\downarrow 0}\frac{P\left(X\in A,\ y-\Delta y < Y < y+\Delta y\right)}{P\left(y-\Delta y < Y < y+\Delta y\right)}, $$ but rather what I stated in the beginning as fact #2:

the function $f:\mathbb{R}\rightarrow\mathbb{R}$ obtained by setting $f(y) := \lim_{\Delta y \mapsto 0} P(X \in A\ |\ Y \in (y-\Delta y, y+\Delta y))$ wherever possible, and, say, $f(y):=0$ elsewhere, is consistent with the traditional measure-theoretic definition of $P(X\in A\ |\ Y=y)$.

This means that, even though the traditional measure-theoretic definition of conditional probability is inscrutable, you can rest assured that it is the same thing as the intuitive limit-oriented definition that you proposed, at least as long as the conditioning variables (i.e. the ones on the right hand side of the $|$ symbol) take values in the Borel space $\mathbb{R}^n$. If they don't, the definitions may or may not coincide, I simply don't know at the present time, but I wouldn't bet a large sum of money on it.

A compelling reason to use the traditional definition is that there are a whole bunch of useful facts that were discovered about it, and we can use these facts out-of-the box to help us figure out complex problems involving conditional probabilities, like the one you posed in your question, and write short and sweet proofs, like the one I opened this answer with.

In fact, there are so many useful facts such as these, that you can often use them to prove stuff about conditional probabilities without once referring to the definition of conditional probability, which is awesome, because the definition is complicated, but many of the facts are easy to state and remember. As a case in point, in this answer I'm not going to use the definition once; just various facts about it.

Another important, but delicate, point to note, before we delve into answering your question, is that the expression $P(X\in A\ |\ Y=y)$ is not technically a conditional probability, but rather a conditional distribution, evaluated at $A$. The difference between a conditional probability and a conditional distribution is that a conditional distribution is a function over $Y$'s range, whereas a conditional probability is a function over $Y$'s domain. So, if $Y$ is a random variable on the measurable space $\Omega$ that takes values in the measurable space $E$, then a conditional distribution is a function of the form $$ y_{\in E} \mapsto P(A\ |\ Y=y), $$ whereas a conditional probability is a function of the form $$ \omega_{\in\Omega} \mapsto P(A\ |\ Y)(\omega). $$ We don't usually specify the $\omega$ when dealing with conditional probabilities; we simply write $P(A\ |\ Y)$.

There is a very simple relation between the conditional distribution $P(A\ |\ Y=y)$ and the corresponding conditional probability $P(A\ |\ Y)$. Writing $f(y):=P(A\ |\ Y=y)$, the relation is $$ P(A\ |\ Y) = f(Y). $$

It is very easy to translate any statement that is couched in the language of conditional distributions to an equivalent statement about conditional probabilities: formally, it is just a matter of dropping the argument "y", so: $P(A\ |\ Y=y)$ (a conditional distribution) becomes $P(A\ |\ Y)$ (a conditional probability). All the useful facts that I mentioned above are stated in terms of conditional probabilities, so we want to make sure we translate any problem involving conditional distributions to one involving conditional probabilities before we start shifting the pieces around.

So, finally, let's address your question. The first tricky thing to deal with is that the expression $$ P(Z_1 \in A_1, Z_2 \in A_2 \vert g(Z_1) \in B, W=w) $$ doesn't look like either a conditional probability or a conditional distribution, since the conditioning part consists of an event, $\{g(Z_1) \in B\}$ as well as of a random variable $W$, and when we deal with conditional probabilities, we want the conditioned part to be homogeneous: either all events, or all random variables. In the former case, we're in the realm of discrete probability. In the latter case, we're in the realm of general (i.e., measure-theoretic) probability.

Fortunately for us, there's nothing easier than rephrasing an event, $A$, in terms of a random variable. Just write $$ A = \{\mathbb{1}_A = 1\}, $$ where $\mathbb{1}_A$ is the indicator function of the event $A$ w.r.t. the sample space $\Omega$.

So first thing we rewrite the expression $$ P(Z_1 \in A_1, Z_2 \in A_2 \vert g(Z_1) \in B, W=w) $$ as $$ P\left(Z_1 \in A_1, Z_2 \in A_2\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}=1,\ W=w\right). $$ Now that we have an expression that looks like a conditional distribution, we drop the arguments on the right hand side of the $|$ symbol, to obtain the conditional probability $$ P\left(Z_1 \in A_1, Z_2 \in A_2\ |\ \mathbb{1}_{\{g(Z_1) \in B\}},\ W\right). $$ We do the same with the other expressions appearing in the equation we wish to prove, to obtain $$ P\left(Z_1 \in A_1, Z_2 \in A_2\ |\ \mathbb{1}_{\{g(Z_1) \in B\}},\ W\right) = P\left(Z_1 \in A_1\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right) P\left(Z_2 \in A_2\ |\ W\right). $$ This is what we are going to prove. Incidentally, when dealing with conditional probabilities, every "$=$" sign must be understood as "$=$ almost surely".

Now, remember I said that there are a whole bunch of facts about conditional probabilities that we can use to write proofs succinctly and elegantly? I lied! Actually, all these facts are about conditional expectations, so, if we want to be able to tap into this cornucopia of facts, we need to rewrite the problem in terms of conditional expectations. How do we do it? Easy-peasy! Remember that, for every event $A$, we have $$ P(A) = E(\mathbb{1}_A)? $$ (if you don't remember, try to prove it. It's a discrete probability fact, and you're an expert in discrete probability if you've reached this far in your studies.) The analogous equality involving conditionals is valid too: $$ P(A\ |\ Y) = E(\mathbb{1}_A\ |\ Y). $$ Just take my word for it, as you must take my word for all the other facts involving conditional expectation that I'm about to use. But write down all these facts, so you can use them later in other problems. In fact, why bother writing them down, when there's an entire Wikipedia section listing all the most useful properties of conditional expectations. We will refer to this list in what follows.

An important consequence of the last equation, one which we will have occasion to refer to later, is that $E(\mathbb{1}_A\ |\ Y)$ is a function of $Y$. Why? Because $P(A\ |\ Y)$ is a function of $Y$. Why? Because, as we mentioned above, $P(A\ |\ Y) = f(Y)$ where $f$ is the conditional distribution $P(A\ |\ Y=y)$. It is possible to generalize the fact that $E(\mathbb{1}_A\ |\ Y)$ is a function of $Y$ to show that $E(X\ |\ Y)$ is a function of $Y$ for any random variable $X$. (Again, you'll have to trust me on this.) Remember this fact! We will use it in the very last step of the solution to your question.

Now that we've spent all this time talking about conditional distributions, I'd like to come clean, that I haven't used this terminology quite precisely. When professional mathematicians talk about conditional distributions, they mean something a little different to the concept I dubbed as such above. But I needed a word to distinguish between $P(A\ |\ Y=y)$ as a function on $Y$'s range, and $P(A\ |\ Y)$ as a function on $Y$'s domain, and "conditional distribution" seemed good enough. I'm not aware of any standard terminology to facilitate distinguishing between these concepts.

Alright, so now that we've transformed all the conditional probabilities into conditional expectations, we're down to proving the following claim: $$ E\left(\mathbb{1}_{A_1}(Z_1)\mathbb{1}_{A_2}(Z_2)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right) = E\left(\mathbb{1}_{A_1}(Z_1)\ |\ \mathbb{1}_{\{g(Z_1) \in B\}}, W\right) E\left(\mathbb{1}_{A_2}(Z_2)\ |\ W\right). $$ (Again, the equation is only true almost surely, but you've already got that, right? So from now on I won't point this out every time. Just keep it in the back of your mind.)

Here's the rest of the proof. I'll write it down in one fell swoop and then go back and explain every step. Setting $D:=g^{-1}(B)$, so that $\{g(Z_1) \in B\} = \{Z_1 \in D\}$, we have $$ \begin{align} E\left(\mathbb{1}_{A_1}(Z_1)\mathbb{1}_{A_2}(Z_2)\ |\ \mathbb{1}_D(Z_1), W\right) &= E\left(E\left(\mathbb{1}_{A_1}(Z_1)\mathbb{1}_{A_2}(Z_2)\ |\ Z_1, W\right)\ |\ \mathbb{1}_D(Z_1), W\right) \tag{1}\\ &= E\left(\mathbb{1}_{A_1}(Z_1)\ E\left(\mathbb{1}_{A_2}(Z_2)\ |\ Z_1, W\right)\ |\ \mathbb{1}_D(Z_1), W\right) \tag{2}\\ &= E\left(\mathbb{1}_{A_1}(Z_1)\ E\left(\mathbb{1}_{A_2}(Z_2)\ |\ W\right)\ |\ \mathbb{1}_D(Z_1), W\right) \tag{3}\\ &= E\left(\mathbb{1}_{A_1}(Z_1)\ |\ \mathbb{1}_D(Z_1), W\right) E\left(\mathbb{1}_{A_2}(Z_2)\ |\ W\right). \tag{4} \end{align} $$

Referring to the Wikipedia list,

  • equation 1 is due to the tower property, since $(\mathbb{1}_D(Z_1), W)$ is a function of $(Z_1, W)$,

  • equations 2 is due to the "pulling out known factors" property,

  • equation 3 is due to Doob's conditional independence property, seeing as it was stated in your question that $Z_1$ and $Z_2$ were independent given $W$.

  • equations 4 is again due to the "pulling out known factors" property, since $E(\mathbb{1}_{A_2}(Z_2)|W)$ is a function of $W$. (Remember we promised we'd use this fact in the last step of the proof? We've come through.)

... and we're done!