What does the decomposition, weak union and contraction rule mean for conditional probability and what are their proofs?

15.8k Views Asked by At

I was reading Koller's book on Probabilistic Graphical Models and was wondering what the decomposition, weak union and contraction properties of conditional probability mean.

But before I ask exactly what I am confused about let me introduce some of Koller's notation so that we are all in the same page (anything else is unclear feel free to ask in the comments). Let capital non-bold stand for random variables say $X$ is a r.v. Let little non bold stand for the assignment to a random variable say $(X = x)$. Also, let me define captial bold letters as sets of random variables. For example $\textbf{X}, \textbf{Y}, \textbf{Z}$ are three sets of random variables. Let small bold letters denote assigments to these sets $\textbf{x}, \textbf{y}, \textbf{z}$ i.e. it denotes assigments of values to the variables in these sets. Let $Val(\textbf{X})$ be the values that the set of random variables can take.

Now, these are the properties/theorems I am trying to prove and understand:

Decomposition:

$$( \textbf{X} \perp \textbf{Y}, \textbf{W} \ | \ \textbf{Z}) \implies (\textbf{X} \perp \textbf{Y} \ | \ \textbf{Z})$$

Weak union:

$$(\textbf{X} \perp \textbf{Y}, \textbf{W} \ | \ \textbf{Z}) \implies (\textbf{X} \perp \textbf{Y} \ | \ \textbf{Z}, \textbf{W})$$

Contraction:

$$(\textbf{X} \perp \textbf{W} \ | \ \textbf{Z}, \textbf{Y}) \ \& \ (\textbf{X} \perp \textbf{Y} \ | \ \textbf{Z}) \implies (\textbf{X} \perp \textbf{Y}, \textbf{W} \ | \ \textbf{Z})$$

Intersection:

$$(\textbf{X} \perp \textbf{Y} | \textbf{Z}, \textbf{W}) \ \& \ (\textbf{X} \perp \textbf{W} | \textbf{Z}, \textbf{Y}) \implies (\textbf{X} \perp \textbf{Y}, \textbf{W} | \textbf{Z})$$

Let's take the first statement. I think its a notational problem (though, not sure). Does the first statement mean, "X is conditionally independent of Y and W Given Z"? i.e. is the first statement the same as $(\textbf{X} \perp (\textbf{Y},\textbf{W}) \ \perp \ \textbf{Z})$. If that is true, then would the first statement imply two things:

$$(\textbf{X} \perp \textbf{Y}, \textbf{W} \ | \ \textbf{Z}) \implies (\textbf{X} \perp \textbf{Y} \ | \ \textbf{Z})$$

and

$$(\textbf{X} \perp \textbf{Y}, \textbf{W} \ | \ \textbf{Z}) \implies (\textbf{X} \perp \textbf{W} \ | \ \textbf{Z})$$

?

I actually intended to prove them as an exercise to myself, however, not being sure if I understood the notation or not made it pretty hard to even attempt a proof (and it also made it hard to try to understand the intuition behind each of the statements, an important thing I wanted to also do, understand it intuitively). Providing one proof as an example and explaining the notation might be good enough for me as an example so that I can attempt the other too.


Bounty Section

Now that I have tried to prove them, I have had more difficulties than I expected, I wanted to see a proof for each one as I was unable to find them on the internet.

I tried proving the first one and this is what I have so far:

we want an expression for $(\textbf{X} \perp \textbf{Y} \mid \textbf{Z}, \textbf{W} )$ using property $(\textbf{X} \perp \textbf{Y}, \textbf{W} \mid \textbf{Z} )$ so lets consider:

$P(\textbf{X},\textbf{Y} | \textbf{Z}, \textbf{W}) = \frac{P(\textbf{X} ,\textbf{Y} , \textbf{Z} , \textbf{W} )}{P(\textbf{Z} , \textbf{W})} = \frac{P(\textbf{Z})P(\textbf{X}, \textbf{Y} , \textbf{W} \mid \textbf{Z})}{P(\textbf{Z} , \textbf{W})}$

Now we can use the property we need by noting that $P(\textbf{X}, \textbf{Y} ,\textbf{W} \mid \textbf{Z}) = P( \textbf{X} | \textbf{Z} )P(\textbf{Y} , \textbf{W} \mid \textbf{Z})$ (due to assumption/property $(\textbf{X} \perp \textbf{Y}, \textbf{W} \mid \textbf{Z})$), thus:

$\frac{P(\textbf{Z})P(\textbf{X},\textbf{Y},\textbf{W} \mid \textbf{Z} )}{P(\textbf{Z}, \textbf{W})} = \frac{P(\textbf{Z})P(\textbf{X}|\textbf{Z})P(\textbf{Y} , \textbf{X} | \textbf{Z} )}{P(\textbf{Z} , \textbf{W})}$

and that was as far as I got for Weak Union. I couldn't get it to be the same as:

$P(\textbf{X},\textbf{Y}|\textbf{Z},\textbf{W}) = P(\textbf{X}|\textbf{Z},\textbf{W})P(\textbf{Y}|\textbf{Z},\textbf{W})$

7

There are 7 best solutions below

7
On

Your interpretation of the statement is correct, and I actually prefer "your" notation $X\perp\!\!\!\perp (Y,W)\mid Z$ rather than the one without the parenthesis.

So what does $X\perp\!\!\!\perp Y\mid Z$ even mean? One possible way of defining this, is to require that $$ P(X\in A,Y\in B\mid Z)=P(X\in A\mid Z)P(Y\in B\mid Z) $$ should hold for any (measurable) $A,B\subseteq\mathbb{R}$. An equivalent definition is that $$ {\rm E}[f(X)g(Y)\mid Z]={\rm E}[f(X)\mid Z]{\rm E}[g(Y)\mid Z] $$ should hold for all bounded (measurable) $f,g:\mathbb{R}\to\mathbb{R}$.

As for the "decomposition" statement, suppose $X\perp\!\!\!\perp (Y,W)\mid Z$ and let $A,B\subseteq\mathbb{R}$. Then $$ P(X\in A,Y\in B\mid Z)=P(X\in A,Y\in B,W\in \mathbb{R}\mid Z) $$ since $P(W\in \mathbb{R})=1$. Using the conditional independence assumption, this equals $$ \begin{align} P(X\in A,(Y,W)\in B\times\mathbb{R}\mid Z)&=P(X\in A\mid Z)P((Y,W)\in B\times\mathbb{R}\mid Z)\\ &=P(X\in A\mid Z)P(Y\in B\mid Z) \end{align} $$ and hence $X\perp\!\!\!\perp Y\mid Z$.

5
On

For two random variables, I take $X \perp Y$ to mean $\mathrm{Cov}[f(X),g(Y)] = \mathbb{E}[f(X)g(Y)] - \mathbb{E}[f(X)]\mathbb{E}[g(Y)]= 0$ for $f,g \in L^2(\mu)$ where $\mu$ is the probability measure. This can be thought of as "orthogonality" in the space of random variables $X,Y$ as our source of randomness varies in the state space $\omega \in \Omega$.


Decomposition

If $X \cap (Y\cup W) \cap \overline{Z} = \varnothing $ then $X \cap Y \cap \overline{Z} = \varnothing $

Weak Union

If $X \cap (Y\cup W) \cap \overline{Z} = \varnothing $ then by De Morgan's law

$$ X \cap (Y\cup W) \cap \overline{Z} = (X \cap Y \cap \overline{Z})\cup (X \cap W \cap \overline{Z}) = \varnothing $$

Then $X \cap Y \cap \overline{Z} \cap \overline{W}= (X \cap Y \cap \overline{Z}) \cap \overline{W} = \varnothing $

Contraction

This was mistyped in your question.

$X \cap W \cap \overline{ Y} \cap \overline{Z} = \varnothing$ and $X \cap Y \cap \overline{Z} = \varnothing$ then

$$ X \cap (Y \cup W) \cap \overline{Z} = ( X \cap Y \cap \overline{Z}) \cup ( X \cap W \cap \overline{Z}) = \varnothing $$

Intersection

$ X \cap Y \cap \overline{Z} \cap \overline{W}= \varnothing $ and $ X \cap W \cap \overline{Z} \cap \overline{Y}= \varnothing $ then I don't see how it follows that

$$ X \cap (Y \cup W) \cap \overline{Z} = \varnothing $$

Also, see these notes. It could be that $X \subset Y \cap W$.


Please excuse me for expressing these in terms of Venn Diagrams. It is always possible to reconstruct the Bayesian Graphical model proof from these.

3
On

Greetings from Germany!
Funny, it seems we have been struggling with exactly the same problem at exactly the same time. The difference is just that I found these unproven lemmas in Judea Pearl's book "Causality". It seems that different authors copy the same text passages, including the confusing parts. But one part of your question helped me to find out what I got wrong. Here is how to solve it:

The most important insight is that what seems to be a definition of conditional independence actually is a first lemma derived from the real definition. So, the Definition of this is given by:

$(X \perp Y \mid Z) \Leftrightarrow P(x \mid z)P(y \mid z) = P(x,y \mid z)$ with $P(z)>0$.

In another publication by Pearl, he also states as a side remark that from now on, it shall be assumed that all conditions have a probability greater than 0, so that this way of combining or splitting up probabilities can be used throughout. But that is what you have to know to find the proofs! For example, it is then possible to derive what I (and probably you) mistook to be the actual definition of conditional probability:

$(X \perp Y \mid Z)$

$\Leftrightarrow P(x,y \mid z) = P(x \mid z)P(y \mid z)$

$ \Leftrightarrow \frac{ P(x,y \mid z) }{ P(y \mid z) } = P(x \mid z) $

$ \Leftrightarrow \frac{P(x,y,z)P(z)}{P(y,z)P(z)} = P(x \mid z)$

$ \Leftrightarrow \frac{P(x,y,z)}{P(y,z)} = P(x \mid z)$

$ \Leftrightarrow P(x \mid y,z) = P(x \mid z) $

Using the real definition of conditional probability, it is also possible to prove that your interpretation of the mysterious concatenation of capital letters was correct. I had the same problem and made the same guess about the semantics, but I was not sure if one really is allowed to split up the expression into two parts. Here is the proof:

$(X \perp Y \mid Z) \& (X \perp W \mid Z)$

$ \Leftrightarrow P(x \mid z)P(y \mid z) = P(x,y\mid z) \ \& \ P(x \mid z)P(w|z) = P(x,w \mid z) $

$ \Leftrightarrow P(x \mid z) = \frac{P(x,y \mid z)}{P(y|\mid z)} \ \& \ P(x \mid z)P(w \mid z) = P(x,w \mid z)$

fitting the left term into the right one we get (i.e. substitute $P(x|z)$ on the RHS with its definition of the LHS):

$ \Leftrightarrow \frac{P(x, y \mid z)P(w\mid z)}{P(y \mid z)} = P(x,w \mid z)$

$ \Leftrightarrow \frac{P(x , y, w \mid z)}{P(y \mid z)} = P(x \mid z)P(w \mid z) $

$ \Leftrightarrow \frac{P(x, y, w \mid z)}{P(y \mid z)P(w \mid z)} = P(x \mid z)$

$ \Leftrightarrow \frac{P(x,y,w \mid z)}{P(y,w \mid z)} = P(x \mid z) $

$ \Leftrightarrow \frac{P(x,y,w,z)P(z)}{P(y,w,z) P(z)} = P(x \mid z)$

$ \Leftrightarrow \frac{P(x,y,w,z)}{P(y,w,z)} = P(x \mid z)$

$ \Leftrightarrow P(x \mid y,w,z) = P(x \mid z)$

$ \Leftrightarrow (X \perp Y, W \mid Z)$

And, finally, this also leads to the proof of the weak union feature we both were trying to find at the same time:

Step 1:

$(X \perp Y, W \mid Z)$

$ \Leftrightarrow (X \perp Y \mid Z) \& (X \perp W \mid Z)$

$\implies$ (using the fact that $A \ \& \ B \implies B$):

$(X \perp W \mid Z)$

$ \Leftrightarrow P(x|w,z) = P(x|z)$

Step 2:

$(X \perp Y, W \mid Z)$

$ \Leftrightarrow P(x|y,w,z) = P(x|z)$

Step 3, combining the results of steps 1 and 2:

$P(x\mid z) = P(x \mid y,w,z) = P(x \mid w,z)$

$\Leftrightarrow (X \perp Y \mid W, Z)$

So, in total we derived:

$(X \perp Y, W \mid Z) \implies (X \perp Y \mid W, Z)$ q.e.d.

Before I realized all this, I had already been able to prove the decomposition feature by assuming the weak union feature. But with this new understanding of what is defined and what is derived, it should be even more straightforward to prove the other features. As soon as I have finished the other proofs, I might post them here, if still needed by someone.

Just one last tip: In Pearl's book, he is citing two earlier articles that can easily be found online and look very informative:

Dawid (1979) Conditional independence in statistical theory

This is the article where this "non-standard notation" of conditional independence was first introduced.

Pearl & Paz (1987) Graphoids: A graph-based logic for reasoning about relevance relations

To my knowledge, this was the first pioneering work on combining graphs with conditional independence to yield graphical probability models. In an article by Chaitin, I recently read that in order to get a feeling for the intuition behind a mathematical theory, it is a good idea to read the very first early publications in which the theory was still under construction, so this might be a good read.

I hope this helps. And if I get stuck again, I now know a new site where to post my questions! :-)

Roul from Bochum & Osnabrück, Germany

5
On

Hi Charlie first some “non-mathematical” remarks which you can erase as soon as you read them:

(1) Thanks indeed for explaining the point system of this site. I will give you the points you deserve for your question as soon as you have given me the points I deserve for my extensive answer. ;-)

(2) I have had a look at Firemath and LibreOffice, but I am still reluctant to use MathML etc. just to produce this one special character in the equations. But I might re-edit everything as soon as I have more practice.

(3) As it seems we are currently trying to learn the same theory, making the same progress and asking the same questions at the same time, it might be fruitful to stay in personal contact to discuss about details which might not be of interest to everyone. So feel free to send a message to roul_john (at) web.de if you like. We would still be posting the most interesting questions and results here on this site, to be sure.

In the following, I will recapitulate the proofs of my previous post with some additional remarks, and I will also present the proofs of the other lemmas you asked for.

I made some minor changes compared to your original questions (mostly concerning the full names of the lemmas), following the article Pearl & Paz (1987) "Graphoids: A graph-based logic for reasoning about relevance relations", which was one of the main origins of this mathematical field.

As before, I am going to write (X#Y|Z) to state that X and Y are conditionally independent given Z.


Proof of Decomposition:

To show:

(X#Y,W|Z) => (X#Y|Z) & (X#W|Z)

We start on the right hand side to fuse the two terms into one:

(X#Y|Z) & (X#W|Z)

<=> P(x|z) * P(y|z) = P(x,y|z) & P(x|z) * P(w|z) = P(x,w|z)

<=> P(x|z) = [P(x,y|z) / P(y|z)] & P(x|z) * P(w|z) = P(x,w|z)

<=> (fitting the left term into the right one:)

[P(x,y|z) / P(y|z)] * P(w|z) = P(x,w|z)

<=> [P(x,y|z) * P(w|z)] / P(y|z) = P(x,w|z)

(From here on, I assume a form of transitivity of conditional independence, see remark below.)

<=> P(x,y,w|z) / P(y|z) = P(x|z) * P(w|z)

<=> P(x,y,w|z) / [P(y|z) * P(w|z)] = P(x|z)

<=> P(x,y,w|z) / P(y,w|z) = P(x|z)

<=> [P(x,y,w,z) * P(z)] / [P(y,w,z) * P(z)] = P(x|z)

<=> P(x,y,w,z) / P(y,w,z) = P(x|z)

<=> P(x|y,w,z) = P(x|z)

<=> (X#Y,W|Z)

q.e.d.

By showing the equivalence, and not just the right side following from the left one, we have actually shown more than was asked for.

But one might still ask for a proof of the “form of transitivity of conditional independence” that I assumed. Currently I do not know how to show that:

P(x|z) * P(y|z) = P(x,y|z) & P(x|z) * P(w|z) = P(x,w|z)

=> P(x|z) * P(y|z) * P(w|z) = P(x,y,w|z) & P(y|z) * P(w|z) = P(y,w|z)

I think I am going to post this as a separate question on this site within the next days.


Proof of Weak Closure for Union:

To show:

(X#Y,W|Z) => (X#Y|Z,W)

Step 1:

(X#Y,W|Z)

<=> (X#Y|Z) & (X#W|Z)

=> (using the fact that A & B => B:)

(X#W|Z)

<=> P(x|w,z) = P(x|z)

Step 2:

(X#Y,W|Z)

<=> P(x|y,w,z) = P(x|z)

Step 3, combining the results of steps 1 and 2:

P(x|z) = P(x|y,w,z) = P(x|w,z)

<=> (X#Y|W,Z)

So, in total we derived:

(X#Y,W|Z) => (X#Y|W,Z)

q.e.d.


For the following proofs, the main fact to keep in mind is that the ordering of parameters of the P function is arbitrary, e.g. P(x,y) <=> P(X=x & Y=y) <=> P(Y=y & X=x) <=> P(y,x) 


Proof of Contraction:

To show:

(X#W|Z,Y) & (X#Y|Z) => (X#Y,W|Z)

First we transform the left hand side:

(X#W|Z,Y) & (X#Y|Z)

<=> P(x|w,z,y) = P(x|z,y) & P(x|y,z) = P(x|z)

<=> P(x|w,z,y) = P(x|z,y) = P(x|z)

The term on the right side is equivalent to:

(X#Y,W|Z)

<=> P(x|y,w,z) = P(x|z)

This is equivalent to the leftmost and the rightmost term of the equation just generated.

q.e.d.


Proof of Weak Closure for Intersection:

To show:

(X#Y|Z,W) & (X#W|Z,Y) => (X#Y,W|Z)

We start by decomposing the right hand side, according to the decomposition rule shown above:

(X#Y,W|Z)

<=> (X#Y|Z) & (X#W|Z)

<=> P(x|y,z) = P(x|z) & P(x|w,z) = P(x|z)

<=> P(x|y,z) = P(x|z) = P(x|w,z)

So proving the lemma amounts to showing that P(x|y,z) = P(x|w,z).

The first term on the left side can be transformed to:

(X#Y|Z,W)

<=> P(x|z,w) * P(y|z,w) = P(x,y|z,w)

<=> P(x|z,w) = P(x,y,z,w) / P(y,z,w)

The second term on the left side can be transformed to:

(X#W|Z,Y)

<=> P(x|z,y) * P(w|z,y) = P(x,w|z,y)

<=> P(x|z,y) = P(x,w,z,y) / P(w,z,y)

So we have found two expressions that can be fit into the equality we aim to show:

P(x|y,z) = P(x|w,z)

<=> P(x,w,z,y) / P(w,z,y) = P(x,y,z,w) / P(y,z,w)

And these two sides are indeed identical, because the parameters within the brackets can be sorted in any order.

q.e.d.


One answer to your questions is still missing, namely “what it all means”. This leads somewhat beyond pure mathematics, but after reading half of the original article by Pearl & Paz, I think I will be able demonstrate that these lemmas are exactly the building blocks needed to construct a network representation of many variables as output, if you get statements of their individual conditional independence as input. I am still in the process of finding out how to write this down as a nice pseudocode algorithm, so be a little patient…. :-)

2
On

Intersection

$ (X \perp Y \mid Z,W) \land (X \perp W \mid Z,Y) \Rightarrow (X \perp Y,W \mid Z) $

From the first independence assertion follows that $P(X \mid Y,W,Z) = P(X \mid Z,W)$, from the second that $P(X \mid Y,W,Z) = P(X \mid Z,Y)$, so we have $P(X \mid Z,Y) = P(X \mid Z,W)$.

Now we can expand $P(X \mid Z)$ into $\omega-cases$, provided that $P$ is a positive distribution: $$ P(X \mid Z) = \sum_wP(X \mid \omega Z)P(\omega \mid Z) = P(X \mid Y,Z)\sum_wP(\omega \mid Z) = P(X \mid Y,Z) = P(X \mid Y,W,Z) $$ $\blacksquare$

3
On

In many of the above solutions, I have found several mistakes. One of them was pointed out by JesterII, i.e. that $X \perp Y \mid Z$ & $X \perp W \mid Z \Rightarrow X \perp Y,W \mid Z$ is false in general. Another thing I noticed in several places is the use of the equality $ P(X,Y \mid Z)P(W \mid Z) = P(X,Y,W \mid Z)$ which is not true unless $W \perp X,Y$. Here is my attempt at the proofs.

  • Decomposition: $ X \perp Y,W \mid Z \Rightarrow X \perp Y \mid Z $
  • Proof: From $X\perp Y,W \mid Z$, we get:

    $$ P(X,Y,W \mid Z) = P(X \mid Z)P(Y,W \mid Z) $$

    Marginalizing W (i.e. summing over all values of W), we get

    $$ \sum_W P(X,Y,W \mid Z) = \sum_W P(X \mid Z)P(Y,W \mid Z) = P(X \mid Z)\sum_W P(Y,W \mid Z)$$

    Thus,

    $$ P(X,Y \mid Z) = P(X \mid Z)P(Y \mid Z) $$

    Hence Proved

  • Contraction: $(X \perp Y \mid Z) \land (X \perp W \mid Y,Z) \Rightarrow X \perp Y,W \mid Z$

  • Proof: By Bayes Rule:

    $$P(X,Y,W \mid Z) = P(X \mid Y,W,Z)P(Y,W \mid Z)$$

    Using the independence property $ X \perp W \mid Y,Z $, we get:

    $$P(X,Y,W \mid Z) = P(X \mid Y,Z)P(Y,W \mid Z)$$

    Now use the independence property $ X \perp Y \mid Z $ to get:

    $$P(X,Y,W \mid Z) = P(X \mid Z)P(Y,W \mid Z)$$

    Hence Proved

  • Weak Union: $ X \perp Y,W \mid Z \Rightarrow X \perp Y \mid Z,W $

  • Proof:By Bayes Rule:

    $$P(X,Y \mid W,Z) = P(X \mid Y,W,Z)P(Y \mid W,Z)$$

    Using the independence property $X \perp Y,W \mid Z $, we get:

    $$P(X,Y \mid W,Z) = P(X \mid Z)P(Y \mid W,Z)$$

    Using the decomposition property, we have $X \perp Y,W \mid Z \Rightarrow X \perp W \mid Z$, which gives us $P(X \mid Z) = P(X \mid W,Z)$. Thus

    $$P(X,Y \mid W,Z) = P(X \mid Z)P(Y \mid W,Z) = P(X \mid W,Z)P(Y \mid W,Z)$$

    Hence Proved.

  • Intersection: $(X \perp Y \mid Z,W) \land (X \perp W \mid Y,Z) \Rightarrow X \perp Y,W \mid Z$

  • Proof: Given the 2 independence properties $(X \perp Y \mid Z,W) \land (X \perp W \mid Y,Z)$, we have

    $$P(X \mid Z,W) = P(X \mid Y,Z,W) = P(X \mid Y,Z)$$

    Applying Bayes Rule to the first and last term of the above equation, we get:

    $$\frac{P(X,W \mid Z)}{P(W \mid Z)} = \frac{P(X,Y \mid Z)}{P(Y \mid Z)}$$

    Thus,

    $$ P(X,W \mid Z)P(Y \mid Z) = P(X,Y \mid Z)P(W \mid Z) $$

    Summing over W, we get:

    $$ \sum_W P(X,W \mid Z)P(Y \mid Z) = \sum_W P(X,Y \mid Z)P(W \mid Z) $$

    Thus,

    $$ P(X \mid Z)P(Y \mid Z) = P(X,Y \mid Z) $$

    This proves that $X \perp Y \mid Z$

    Given $X \perp Y \mid Z$ and $(X \perp W \mid Y,Z)$, we can now apply the contraction property to get $X \perp Y,W \mid Z$

    Hence Proved

0
On

was wondering what the decomposition, weak union and contraction properties of conditional probability mean.

For those discovering this thread in their quest for intuitive meanings of these axioms, I would recommend the opening of Chapter 3 in Pearl's Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, and particularly Figure 3.1:

Graphical interpretation of the axioms governing conditional independence

View set $Z$ as "informationally separating" $X$ from $Y$. Then Decomposition expresses the idea that if $Z$ separates $X$ from the compound set $S = Y \cup W$ then it separates $X$ from every subset of $S$. Or, in other words, if two combined items of information are judged irrelevant to $X$, then each item is irrelevant to $X$ separately as well.

The Weak Union axiom states that if $Y$ is irrelevant to $X$ then finding some additional item $W$ from the same space of those irrelevant to $X$ (separated by $Z$) and augmenting the separator with it will not render $Y$ relevant to $X$ as a result.

And so on: Contraction is for reducing the size of the separating set, etc.