Find minima for the KL divergence

777 Views Asked by At

The task is exercise 33.7 from the book Information Theory, Inference, and Learning Algorithms.

The question is related to finding three distinc minima by minimizing the reversed KL divergence $KL(q||p)$.

The gist of the questions is as follows. Given the joint distribution $p(x,y)$ where rows represent $y$ and the columns $x$.

$$ \begin{array}{l|llll} & 1 & 2 & 3 & 4 \\ \hline 1 & 1 / 8 & 1 / 8 & 0 & 0 \\ 2 & 1 / 8 & 1 / 8 & 0 & 0 \\ 3 & 0 & 0 & 1 / 4 & 0 \\ 4 & 0 & 0 & 0 & 1 / 4 \end{array} $$ If we approximate p with q by minimizing $KL(q||p)$

1) show that there are three distinct minima

2) what is the value of $KL(q||p)$ if we set $q(x,y) = p(x)p(y)$?

We may use that $q(x,y) = q(x)q(y)$.

My progress so far: $$ \begin{array}{l} \mathrm{KL}(q(x, y) \| p(x, y))=\sum_{x, y} q(x, y) \log \frac{q(x, y)}{p(x, y)}=\sum_{x, y} q(x) q(y) \log \frac{q(x) q(y)}{p(x, y)} \\ =\sum_{x} q(x) \log q(x)+\sum_{y} q(y) \log q(y)-\sum_{x, y} q(x) q(y) \log p(x, y) \end{array} $$

Setting derivative wrt $q(x)$ to zero: $$ \begin{array}{l} \frac{\partial}{\partial q(x)} \operatorname{KL}(q(x, y) \| p(x, y))=0 \iff \\ \frac{\partial}{\partial q(x)} \sum_{x} q(x) \log q(x)+\frac{\partial}{\partial q(x)} \sum_{y} q(y) \log q(y)=\frac{\partial}{\partial q(x)}\sum_{x} q(x) \sum_{y} q(y) \log p(x, y) \end{array} $$

However i am not sure how to do these derivatives, that i suspect needs to be solved to proceed. And then solve wrt. $q(y)$ after. I am not very familiar with variational calculus. Any help is appreciated.

1

There are 1 best solutions below

0
On BEST ANSWER

I'm going to change notation a bit - instead of writing $q(x)$ and $q(y),$ I'll write $q_X(x), q_Y(y).$ Similarly, instead of $q(x,y),$ I'll write $q_{XY}(x,y)$.

Instead of going straight through to the Lagrangians, let us stare at the structure of the question a bit.

First, note that since $q_{XY}({x,y})$ is a product distribution, if $q_X(x) > 0$ and $q_Y(y) > 0,$ then $q_{XY}({x,y}) = q_X(x) q_Y(y) > 0.$ Equivalently, if $q_X$ has support $S_X$ and $q_Y$ has support $S_Y$, then $q_{XY}$ has support $S_X \times S_Y$.

Next, observe that if there is a pair $x,y$ such that $p({x,y}) = 0$ but $q_{XY}({x,y}) > 0$ then $D(q\|p) = \infty$. In our problem, this means that the minimising $q_{XY}({x,y})$ must be positive only on $\{1,2\} \times \{1,2\} \cup \{3,3\} \cup \{4,4\}.$

So, our candidate minimising $q_X$ and $q_Y$ must have supports such that $$S_X \times S_Y \subset (\{1,2\} \times \{1,2\}) \cup \{3,3\} \cup \{4,4\}.$$

This directly leaves us with three cases:

  1. $S_X \subset \{1,2\}$ and $S_Y \subset \{1,2\}$.
  2. $S_X = S_Y = \{3\}.$
  3. $S_X = S_Y = \{4\}.$

Cases 2 and 3 are simple - these consist of only one distribution each, which I'll call $q^3$ and $q^4$, which are $q^3_{X} = q^3_Y = (0,0,1,0)$ and $q^4_{X} = q^4_{Y} = (0,0,0,1)$. Note that these are both local minima - if we increase $q^3(x,y)$ for any $(x,y) \neq (3,3)$ then the divergence becomes $\infty$, so all distributions close to $q^3$ (close defined according to, say, the usual $\ell_2$ distance) have larger divergence. Similarly for $q^4$. Finally, note that $$D(q^3\|p) = D(q^4\|p) = \log(4).$$

This leaves case $1$. Let $q_X$ and $q_X$ be supported on $\{1,2\}$ each. Note that \begin{align} D(q\|p) &= \sum_{x= 1}^2 \sum_{y = 1}^2 q_X(x)q_Y(y) \log \frac{q_X(x) q_Y(y)}{{p(x,y)}} \\ &= \sum_{x= 1}^2 \sum_{y = 1}^2 q_X(x)q_Y(y) \log {q_X(x) q_Y(y)} + \log(8) \\ &= \log(8) - H(q_X) - H(q_Y) .\end{align}

Notice immediately that if we take $q_X$ and $q_Y$ to be the max-entropy distributions satisfying the given constraints, then we'll minimise the above. The contraint is just that both have supports on $\{1,2\}.$ So, the minimisers here are $q_X = q_Y = (1/2, 1/2, 0,0)$, and we attain the value $\log(8) - 2\log(2) = \log(2).$ Importantly, this case is convex, so the minimum above is unique.

So, we have found exactly three distinct local minima, of which the case 1 minimum is the unique global minimum.


Some comments on the line of attack you took, and comments you made, in the question:

  1. This is not a variational calculus problem - you're looking for two vectors in the simplex $\Delta^4$. This is very much in the realm of standard vector calculus. Each $q_X(x), q_Y(y)$ is a variable, and you can compute partial derivatives as you normally would.
  2. The optimisation problem is constrained - you must have $\sum_x q_X(x) = 1$, similarly for $Y$. These constraints must be incorporated into your Lagrangian (you must also, btw, have $q_X(x) \ge 0$ and $q_Y(y) \ge 0$).
  3. The usual constrained optimisation approach is a little iffy, since the objective involves a product of decision variables via the term $\sum q_X(x) q_Y(y) \log(1/p(x,y))$, rendering the problem non-convex. In principle, this also has to occur in the study of case 1 in my argument above, but this is made easy by the fact that $p(x,y)$ is a constant in $\{1,2\} \times \{1,2\},$ at which point I can use that both $q_X$ and $q_Y$ are distributions supported on $\{1,2\}$ to simply evaluate this term to $\log(8)$.
  4. Finally, if you had done question 2 first, then the line of attack I developed above would have been more obvious. Basically, the marginal $p_X(x), p_Y(y)$ have full support, so using $q = p_X(x) p_Y(y)$ would yield infinite divergence. This hints at the fact that the minimisers we are interested in must have restricted supports.