Confused by Kullback-Leibler on conditional probability distributions

5.7k Views Asked by At

I understand the Kullback-Leibler divergence well enough when it comes to a probability distribution over a single variable. However, I'm currently trying to teach myself variational methods and the use of the KL divergence in conditional probabilities is catching me out. The source I'm working from is here.

Specifically, the author represents the KL divergence as follows:

$$KL(Q_ϕ (Z|X)||P(Z|X)) = \sum_{z∈Z} q_ϕ (z|x) log\frac{q_ϕ (z|x)}{p(z|x)}$$

Where the confusion arises is on the summation across $Z$. Given that $z \in Z$ and $x \in X$, I would have expected (by analogy with conditional entropy) a double sum here of the form:

$$KL(Q_ϕ (Z|X)||P(Z|X)) = \sum_{z∈Z} \sum_{x∈X} q_ϕ (z|x) log\frac{q_ϕ (z|x)}{p(z|x)}$$

Otherwise, it seems to me that KL is only being calculated for one sample from $X$. Am I missing something basic here? And if my intuitions are off, any tips on getting them back on track would be useful––I'm teaching myself this stuff, so I don't have the benefit of formal instruction.

2

There are 2 best solutions below

2
On BEST ANSWER

It depends on whether you are conditioning on a random variable or an event.

Given a random variable $x$,

$$ \operatorname{KL}[p(y \mid x) \,\|\, q(y \mid x)] \doteq \iint p(\bar{x},\bar{y}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})} \mathrm{d}\bar{x} \mathrm{d}\bar{y} \quad\text{or}\quad \sum_{\bar{x}}\sum_{\bar{y}} p(\bar{x},\bar{y}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})}. $$

Given an event $\bar{x}$,

$$ \operatorname{KL}[p(y \mid \bar{x}) \,\|\, q(y \mid \bar{x})] \doteq \int p(\bar{y}|\bar{x}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})} \mathrm{d}\bar{y} \quad\text{or}\quad \sum_{\bar{y}} p(\bar{y}|\bar{x}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})}. $$

Note how conditioning on an event is equivalent to changing the probability distribution over its variable to a point mass. This is what turns the joint into a conditional above,

$$ p'(x,y) \doteq p(y|x)\delta_{\bar{x}}(x)=p(y|\bar{x}). $$

To be more explicit, you can also choose instead of the KL conditioned on a random variable to use an expectation over event of the KL conditioned on those event,

$$ \operatorname{KL}[p(y \mid x) \,\|\, q(y \mid x)] =\operatorname{E}_{\bar{x}\sim p(x)}\big[ \operatorname{KL}[p(y \mid \bar{x}) \,\|\, q(y \mid \bar{x})] \big]. $$

Mixing up random variables and event is quite common but it's often easy to know from the context which is meant.

4
On

I don't quite see what confuses you. Think about how we compute, for example, a conditional expectation: $E(Z \mid X)=\sum_Z P(Z \mid X) $ : that is, we sum only over $Z$, and the result is a function of the conditioning variable $X$. (Put in other way, your each value of $X$ we have that $P(Z \mid X=x)$ is a different probability distribution - and hence for each value of $X$ we have different values of the (conditioned to $X=x$) expectation, variance, etc). The same happens here. And the conditioned KL divergence is not a number, but a function of $X$.