The task is exercise 33.7 from the book Information Theory, Inference, and Learning Algorithms.
The question is related to finding three distinc minima by minimizing the reversed KL divergence $KL(q||p)$.
The gist of the questions is as follows. Given the joint distribution $p(x,y)$ where rows represent $y$ and the columns $x$.
$$ \begin{array}{l|llll} & 1 & 2 & 3 & 4 \\ \hline 1 & 1 / 8 & 1 / 8 & 0 & 0 \\ 2 & 1 / 8 & 1 / 8 & 0 & 0 \\ 3 & 0 & 0 & 1 / 4 & 0 \\ 4 & 0 & 0 & 0 & 1 / 4 \end{array} $$ If we approximate p with q by minimizing $KL(q||p)$
1) show that there are three distinct minima
2) what is the value of $KL(q||p)$ if we set $q(x,y) = p(x)p(y)$?
We may use that $q(x,y) = q(x)q(y)$.
My progress so far: $$ \begin{array}{l} \mathrm{KL}(q(x, y) \| p(x, y))=\sum_{x, y} q(x, y) \log \frac{q(x, y)}{p(x, y)}=\sum_{x, y} q(x) q(y) \log \frac{q(x) q(y)}{p(x, y)} \\ =\sum_{x} q(x) \log q(x)+\sum_{y} q(y) \log q(y)-\sum_{x, y} q(x) q(y) \log p(x, y) \end{array} $$
Setting derivative wrt $q(x)$ to zero: $$ \begin{array}{l} \frac{\partial}{\partial q(x)} \operatorname{KL}(q(x, y) \| p(x, y))=0 \iff \\ \frac{\partial}{\partial q(x)} \sum_{x} q(x) \log q(x)+\frac{\partial}{\partial q(x)} \sum_{y} q(y) \log q(y)=\frac{\partial}{\partial q(x)}\sum_{x} q(x) \sum_{y} q(y) \log p(x, y) \end{array} $$
However i am not sure how to do these derivatives, that i suspect needs to be solved to proceed. And then solve wrt. $q(y)$ after. I am not very familiar with variational calculus. Any help is appreciated.
I'm going to change notation a bit - instead of writing $q(x)$ and $q(y),$ I'll write $q_X(x), q_Y(y).$ Similarly, instead of $q(x,y),$ I'll write $q_{XY}(x,y)$.
Instead of going straight through to the Lagrangians, let us stare at the structure of the question a bit.
First, note that since $q_{XY}({x,y})$ is a product distribution, if $q_X(x) > 0$ and $q_Y(y) > 0,$ then $q_{XY}({x,y}) = q_X(x) q_Y(y) > 0.$ Equivalently, if $q_X$ has support $S_X$ and $q_Y$ has support $S_Y$, then $q_{XY}$ has support $S_X \times S_Y$.
Next, observe that if there is a pair $x,y$ such that $p({x,y}) = 0$ but $q_{XY}({x,y}) > 0$ then $D(q\|p) = \infty$. In our problem, this means that the minimising $q_{XY}({x,y})$ must be positive only on $\{1,2\} \times \{1,2\} \cup \{3,3\} \cup \{4,4\}.$
So, our candidate minimising $q_X$ and $q_Y$ must have supports such that $$S_X \times S_Y \subset (\{1,2\} \times \{1,2\}) \cup \{3,3\} \cup \{4,4\}.$$
This directly leaves us with three cases:
Cases 2 and 3 are simple - these consist of only one distribution each, which I'll call $q^3$ and $q^4$, which are $q^3_{X} = q^3_Y = (0,0,1,0)$ and $q^4_{X} = q^4_{Y} = (0,0,0,1)$. Note that these are both local minima - if we increase $q^3(x,y)$ for any $(x,y) \neq (3,3)$ then the divergence becomes $\infty$, so all distributions close to $q^3$ (close defined according to, say, the usual $\ell_2$ distance) have larger divergence. Similarly for $q^4$. Finally, note that $$D(q^3\|p) = D(q^4\|p) = \log(4).$$
This leaves case $1$. Let $q_X$ and $q_X$ be supported on $\{1,2\}$ each. Note that \begin{align} D(q\|p) &= \sum_{x= 1}^2 \sum_{y = 1}^2 q_X(x)q_Y(y) \log \frac{q_X(x) q_Y(y)}{{p(x,y)}} \\ &= \sum_{x= 1}^2 \sum_{y = 1}^2 q_X(x)q_Y(y) \log {q_X(x) q_Y(y)} + \log(8) \\ &= \log(8) - H(q_X) - H(q_Y) .\end{align}
Notice immediately that if we take $q_X$ and $q_Y$ to be the max-entropy distributions satisfying the given constraints, then we'll minimise the above. The contraint is just that both have supports on $\{1,2\}.$ So, the minimisers here are $q_X = q_Y = (1/2, 1/2, 0,0)$, and we attain the value $\log(8) - 2\log(2) = \log(2).$ Importantly, this case is convex, so the minimum above is unique.
So, we have found exactly three distinct local minima, of which the case 1 minimum is the unique global minimum.
Some comments on the line of attack you took, and comments you made, in the question: