How to understand the product of two conditional probabilities?

5.8k Views Asked by At

I am struggling a little bit with making sense of the distribution of bigrams in an artificial language I randomly generated from english.

Every word occurs with an equal probability but the syllables and phones that make up the word have a specific distribution such that the transitional probabilities within a word are much higher than between words.

I am trying to show that the information entropy of a word (log2(1/#words)) is the summation of the IE of it's individual components. I tried working it out on my own and got stuck, maybe someone could help get my gears going again?

So I am aware of the product rule:

p(A,B) = p(A|B) * p(B)

Where A and B are considered independent events if p(A|B) = p(A,B)/p(B) = p(A)

If A and B are considered independent events then we can define the probability of p(A,B) as p(A) * p(B).

However what does it mean when you multiply two separate conditional probabilities in this fashion:

p(B|A) * p(C|B)?

Does this simplify to p(A, B, C)?

Also does this generalize when we write the information entropy of A, B and C?

I.e] H(A,B,C) = H(B|A) + H(C|B)

Where H(x) = log2(1/p(x))

Note that I am interested in the case where A, B and C occur in that exact order.

1

There are 1 best solutions below

1
On BEST ANSWER

$p(B|A) p(C|B)?$ Does this simplify to $p(A, B, C)$?

No. Why would it? What you can write (for example) is

$$P(A,B,C)=P(C\mid B,A) \, P(B,A) = P(C\mid B,A) \, P(B \mid A) \, P(A)$$

In the particular case in which $A \to B \to C$ form a Markov chain (the probability of the "present" conditioned on the full "past" depends only on the most recent past) (which you don't state, but seem to suggest with your statement about the "order" of ocurrence), then $P(C \mid B,A)$ simplifies to $P(C\mid B)$ and you get the familiar formula

$$P(A,B,C) = P(C \mid B) \, P(B \mid A) \, P(A)$$ which can be generalized (always under the markovian assumption) to more variables. Say

$$P(X_1,X_2 \cdots X_n) = P(X_n \mid X_{n-1}) P(X_{n-1} \mid X_{n-2}) \cdots P(X_2 \mid X_1) \, P(X_1)$$ A similar formula ("chain rule") works for entropies:

$$H(A,B,C) = H(C \mid B,A) + H(B \mid A) + H(A)$$ and, under the markovian assumption:

$$H(A,B,C) = H(C \mid B) + H(B\mid A) + H(A)$$