I am trying to solve a variational inference problem, and I need help minimising the Kullback-Leibler divergence (KLD) between two joint distributions. This problem is related to my previous question.
I have a joint distribution \begin{align} p(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^{K} \big( \alpha_{k} \, p_{k} (\mathbf{x}) \big)^{z_{k}}, \end{align} where $p_{k}(\mathbf{x}) = \mathcal{N} (\mathbf{x}; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})$ is a Gaussian distribution with mean $\boldsymbol{\mu}_{k}$ and covariance $\boldsymbol{\Sigma}_{k}$, and $\alpha_{k}$ is a weight that must satisfy $0 < \alpha_{k} \leq 1$ together with $\sum_{k=1}^{K} \alpha_{k} = 1$. Here $\mathbf{z} = \begin{bmatrix} z_{1} & z_{2} & \cdots & z_{K} \end{bmatrix}^{T}$ is a $1$-of-$K$ binary variable, where $z_{k} \in \{ 0, 1 \}$ such that $\sum_{k=1}^{K} z_{k} = 1$. That is to say, one element of $\mathbf{z}$ must equal one, while the rest must equal zero. We now introduce a similarly defined joint distribution \begin{align} q(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^{K} \big( \beta_{k} \, q(\mathbf{x}) \big)^{z_{k}}, \end{align} where $q(\mathbf{x}) = \mathcal{N} (\mathbf{x}; \boldsymbol{\nu}, \mathbf{T})$, and $\beta_{k} \in (0, 1]$ such that $\sum_{k=1}^{K} \beta_{k} = 1$. I want to find the optimal parameters $\boldsymbol{\nu}$, $\mathbf{T}$ and $\{ \beta_{k} \}_{k=1}^{K}$ that minimise the KLD \begin{align} D_{\text{KL}} (q (\mathbf{x}, \mathbf{z}) \, || \, p(\mathbf{x}, \mathbf{z}) ) &= \sum_{\mathbf{z}} \int q(\mathbf{x}, \mathbf{z}) \ln \Big( \frac{ q(\mathbf{x}, \mathbf{z}) }{ p(\mathbf{x}, \mathbf{z}) } \Big) d \mathbf{x}. \end{align} I believe the KLD can be simplified as follows: \begin{align} D_{\text{KL}} (q (\mathbf{x}, \mathbf{z}) \, || \, p(\mathbf{x}, \mathbf{z}) ) = \int q(\mathbf{x}) \ln \Big( \frac{q(\mathbf{x})}{f(\mathbf{x})} \Big) d \mathbf{x} - \sum_{k=1}^{K} \beta_{k} \ln \Big( \frac{\beta_{k}}{\alpha_{k}} \Big) - \ln(Z), \tag{1} \end{align} where we define the probability distribution \begin{align} f(\mathbf{x}) = \frac{1}{Z} \prod_{k=1}^{K} \big( p_{k} (\mathbf{x}) \big)^{\beta_{k}} \end{align} with the normalising constant \begin{align} Z = \int \prod_{k=1}^{K} \big( p_{k} (\mathbf{x}) \big)^{\beta_{k}} d \mathbf{x}. \end{align} Unfortunately, I don't know how to proceed. How do I find these optimal parameters? (It is tempting to think the KLD is minimised when $q(\mathbf{x}) = f(\mathbf{x})$ and $\beta_{k} = \alpha_{k}$ for all $k$, but I don't think this is true.)
I'll post my attempt to minimise this KLD as an answer. I used what I believe to be an expectation-maximisation type approach. It is a two-step iteration, for iteration $t+1$:
This process is iterated until convergence. The point of this is to find the Gaussian $q(\mathbf{x})$ that minimises the KLD $D_{\text{KL}} (q (\mathbf{x} \, || \, p(\mathbf{x}))$, where $p(\mathbf{x}) = \sum_{k=1}^{K} \alpha_{k} p_{k} (\mathbf{x})$ is a Gaussian mixture (GM). This process is known as information projection, and it is meant to be peak seeking. This process isn't possible using an explicit GM, so I decided shift the GM into exponential family by representing it as joint distribution. The weighted geometric average looks promising, as it should favour the component in the GM with the largest weight and the highest precision. Unfortunately, my approach is strongly dependent of the initial values for $\boldsymbol{\nu}^{(0)}$, $\mathbf{T}^{(0)}$ and $\{ \beta_{k}^{(0)} \}_{k=1}^{K}$. Usually, the algorithm just converges to the Gaussian $p_{k} (\mathbf{x})$ that is closest to the initial conditions.