Minimising the Kullback-Leibler divergence between joint probability distributions

381 Views Asked by At

I am trying to solve a variational inference problem, and I need help minimising the Kullback-Leibler divergence (KLD) between two joint distributions. This problem is related to my previous question.

I have a joint distribution \begin{align} p(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^{K} \big( \alpha_{k} \, p_{k} (\mathbf{x}) \big)^{z_{k}}, \end{align} where $p_{k}(\mathbf{x}) = \mathcal{N} (\mathbf{x}; \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k})$ is a Gaussian distribution with mean $\boldsymbol{\mu}_{k}$ and covariance $\boldsymbol{\Sigma}_{k}$, and $\alpha_{k}$ is a weight that must satisfy $0 < \alpha_{k} \leq 1$ together with $\sum_{k=1}^{K} \alpha_{k} = 1$. Here $\mathbf{z} = \begin{bmatrix} z_{1} & z_{2} & \cdots & z_{K} \end{bmatrix}^{T}$ is a $1$-of-$K$ binary variable, where $z_{k} \in \{ 0, 1 \}$ such that $\sum_{k=1}^{K} z_{k} = 1$. That is to say, one element of $\mathbf{z}$ must equal one, while the rest must equal zero. We now introduce a similarly defined joint distribution \begin{align} q(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^{K} \big( \beta_{k} \, q(\mathbf{x}) \big)^{z_{k}}, \end{align} where $q(\mathbf{x}) = \mathcal{N} (\mathbf{x}; \boldsymbol{\nu}, \mathbf{T})$, and $\beta_{k} \in (0, 1]$ such that $\sum_{k=1}^{K} \beta_{k} = 1$. I want to find the optimal parameters $\boldsymbol{\nu}$, $\mathbf{T}$ and $\{ \beta_{k} \}_{k=1}^{K}$ that minimise the KLD \begin{align} D_{\text{KL}} (q (\mathbf{x}, \mathbf{z}) \, || \, p(\mathbf{x}, \mathbf{z}) ) &= \sum_{\mathbf{z}} \int q(\mathbf{x}, \mathbf{z}) \ln \Big( \frac{ q(\mathbf{x}, \mathbf{z}) }{ p(\mathbf{x}, \mathbf{z}) } \Big) d \mathbf{x}. \end{align} I believe the KLD can be simplified as follows: \begin{align} D_{\text{KL}} (q (\mathbf{x}, \mathbf{z}) \, || \, p(\mathbf{x}, \mathbf{z}) ) = \int q(\mathbf{x}) \ln \Big( \frac{q(\mathbf{x})}{f(\mathbf{x})} \Big) d \mathbf{x} - \sum_{k=1}^{K} \beta_{k} \ln \Big( \frac{\beta_{k}}{\alpha_{k}} \Big) - \ln(Z), \tag{1} \end{align} where we define the probability distribution \begin{align} f(\mathbf{x}) = \frac{1}{Z} \prod_{k=1}^{K} \big( p_{k} (\mathbf{x}) \big)^{\beta_{k}} \end{align} with the normalising constant \begin{align} Z = \int \prod_{k=1}^{K} \big( p_{k} (\mathbf{x}) \big)^{\beta_{k}} d \mathbf{x}. \end{align} Unfortunately, I don't know how to proceed. How do I find these optimal parameters? (It is tempting to think the KLD is minimised when $q(\mathbf{x}) = f(\mathbf{x})$ and $\beta_{k} = \alpha_{k}$ for all $k$, but I don't think this is true.)

1

There are 1 best solutions below

0
On

I'll post my attempt to minimise this KLD as an answer. I used what I believe to be an expectation-maximisation type approach. It is a two-step iteration, for iteration $t+1$:

  1. I hold the weights $\{\beta_{k}^{(t)}\}_{k=1}^{K}$ fixed, and I want to find the parameters $\boldsymbol{\nu}^{(t+1)}$ and $\mathbf{T}^{(t+1)}$ that minimise the KLD: \begin{align} D_{\text{KL}} \big( q^{(t+1)}(\mathbf{x}, \mathbf{z}) \, || \, p(\mathbf{x}, \mathbf{z}) \big) = \int q^{(t+1)}(\mathbf{x}) \ln \Big( \frac{ q^{(t+1)}(\mathbf{x}) }{ f^{(t)} (\mathbf{x}) } \Big) d \mathbf{x} + \sum_{k=1}^{K} \beta_{k}^{(t)} \ln \Big( \frac{\beta_{k}^{(t)}}{\alpha_{k}} \Big) - \ln(Z^{(t)}). \end{align} I believe this KLD is minimised when $$q^{(t+1)}(\mathbf{x}) = f^{(t)}(\mathbf{x}) = \frac{1}{Z^{(t)}} \prod_{k=1}^{K} \big( p_{k} (\mathbf{x}) \big)^{\beta_{k}^{(t)}},$$ and $Z^{(t)}$ is a normalising constant as before. Here $q^{(t+1)}(\mathbf{x})$ is the weighted geometric average of the set of Gaussians $\{ p_{k} (\mathbf{x}) \}_{k=1}^{K}$, and we have the parameters \begin{align} \mathbf{T}^{(t+1)} &= \Big( \sum_{k=1}^{K} \beta_{k}^{(t)} \boldsymbol{\Sigma}_{k} \Big)^{-1}, \\ \boldsymbol{\nu}^{(t+1)} &= \mathbf{T}^{(t+1)} \sum_{k=1}^{K} \beta_{k}^{(t)} \boldsymbol{\Sigma}_{k}^{-1} \boldsymbol{\mu}_{k}. \end{align}
  2. I now hold $\boldsymbol{\nu}^{(t+1)}$ and $\mathbf{T}^{(t+1)}$ fixed, and I determine the new weights $\{ \beta_{k}^{(t+1)} \}_{k=1}^{K}$. I rewrite the KLD as \begin{align} D_{\text{KL}} \big( q^{(t+1)}(\mathbf{x}, \mathbf{z}) \, || \, p(\mathbf{x}, \mathbf{z}) \big) = \sum_{k=1}^{K} \beta_{k}^{(t+1)} D_{\text{KL}} \big( q^{(t+1)}(\mathbf{x}) \, || \, p_{k}(\mathbf{x}) \big) + \sum_{k=1}^{K} \beta_{k}^{(t+1)} \ln \Big( \frac{\beta_{k}^{(t+1)}}{\alpha_{k}} \Big), \tag{1} \end{align} where \begin{align} D_{\text{KL}} \big( q^{(t+1)}(\mathbf{x}) \, || \, p_{k}(\mathbf{x}) \big) = \int q^{(t+1)}(\mathbf{x}) \ln \Big( \frac{ q^{(t+1)}(\mathbf{x}) }{ p_{k} (\mathbf{x}) } \Big) d \mathbf{x} \end{align} is the KLD between two Gaussian distributions, and it has a known closed-form solution. I want to minimise Equation 1 subject to the constraint that $\sum_{k=1}^{K} \beta_{k}^{(t+1)} = 1$. Using Lagrange multipliers, I obtain the following equation for an updated weight: \begin{align} \beta_{k}^{(t+1)} = \dfrac{ \alpha_{k} \exp \Big(- D_{\text{KL}} \big( q^{(t+1)}(\mathbf{x}) \, || \, p_{k}(\mathbf{x}) \big) \Big) } { \sum_{i=1}^{K} \alpha_{i} \exp \Big(- D_{\text{KL}} \big( q^{(t+1)}(\mathbf{x}) \, || \, p_{i}(\mathbf{x}) \big) \Big) }. \end{align} I'm not confident the equation for the updated weight is correct.

This process is iterated until convergence. The point of this is to find the Gaussian $q(\mathbf{x})$ that minimises the KLD $D_{\text{KL}} (q (\mathbf{x} \, || \, p(\mathbf{x}))$, where $p(\mathbf{x}) = \sum_{k=1}^{K} \alpha_{k} p_{k} (\mathbf{x})$ is a Gaussian mixture (GM). This process is known as information projection, and it is meant to be peak seeking. This process isn't possible using an explicit GM, so I decided shift the GM into exponential family by representing it as joint distribution. The weighted geometric average looks promising, as it should favour the component in the GM with the largest weight and the highest precision. Unfortunately, my approach is strongly dependent of the initial values for $\boldsymbol{\nu}^{(0)}$, $\mathbf{T}^{(0)}$ and $\{ \beta_{k}^{(0)} \}_{k=1}^{K}$. Usually, the algorithm just converges to the Gaussian $p_{k} (\mathbf{x})$ that is closest to the initial conditions.