When can the collapsed Gibbs sampling be applied?

1.3k Views Asked by At

I understand Gibbs sampling is a means of statistics inference, and it seems that sometimes certain variables can be integrated out in the sampling process, known as collapsed Gibbs sampling. I really want to know in what circumstances the collapsed Gibbs sampling can be applied, and which variables can be integrated out?

I did some search on Google, and it appears there are no detailed explanation on it. Though there are some papers on applying collapsed Gibbs sampling on Latent Dirichlet Allocation (LDA), I am no expert of MCMC and have no idea what LDA is, so it may be hard for me to read those papers. Can someone answer this question and it would be better, provide some examples? Much appreciated!

1

There are 1 best solutions below

2
On BEST ANSWER

I mentioned this paper in a comment, but the explanation of why the Collapsed Gibbs sampler is efficient was kind of technical.

Suppose you had a distribution $\Pi(x_1,...,x_d)$ (using the notation in that paper). A Gibbs sampler finds the conditional distributions and samples from them to construct a Markov chain with the distribution as it's stationary distribution. So you'd sample from:

  1. $\pi(x_1|x_{-1})$
  2. $\pi(x_2|x_{-1})$
  3. ...
  4. $\pi(x_d|x_{-d})$

Where the notation $x_{-k}$ means all variables minus variable $k$, so each step samples from the variable conditioned on all the other variables.

Sometimes these variables are sampled grouped together, such as sampling $x_{d-1},x_{d}| x_{(-d-1,-d)}$.

One thing you could do if you had the ability to perform the integration, and you weren't worried about the inference on the parameter, is integrate out $x_{d}$ from that conditional distribution to sample from the conditional distribution of $x_{d-1}| x_{(-d-1,-d)}$ (you would need to know that the conditional distributions of the other variables doesn't depend on $x_{d}$).

Gibbs samplers have autocorrelation between the samples. This is why many people thin the samples to reduce the autocorrelation. The three schemes theorem presented in that paper shows that, for the following three approaches to sampling:

  1. Sampling each variable conditional on all the others.
  2. Sampling all the variables except $x_d$ and $x_{d-1}$ conditional on all the others, then sampling $x_d,x_{d-1}$ jointly conditional on all the others.
  3. Integrating out $x_d$ and sampling the other variables conditional on all the others with $x_d$ removed

Then the autocorrelation of 3 is less than the autocorrelation of 2 which is less than the autocorrelation of 1.

However, sometimes you might not be able to, strictly speaking, optimize the efficiency of the sampler because of mathematical intractability, or maybe the actual sampling of $x_d,x_{d-1}$ jointly is more difficult than the sampling of them individually. Quoting the last paragraph of section 4 of that paper:

Generally, the Gibbs sampler itself does not specify exactly how the random variable should be partitioned. This is a decision that users have to make, providing an opportunity to their ingenuity. A good Gibbs sampling algorithm must meet two conflicting criteria: (1) drawing one component conditioned on the others must be computationally simple, and (2) the Markov chain induced by the Gibbs sampler with such partitioning components must converge reasonably fast to its equilibrium distribution. For example, drawing the variables jointly with no partitioning at all is optimal for convergence but is formidable; this is the reason why the Gibbs sampler was invented. Theorem 1 provides a theo retical confirmation of such a confliction. It seems to be a reasonable strategy to "group" or "collapse" whenever it is computationally feasible, like those examples in Section 3. But as a whole, it is left to the reader to make compromises to balance all the aforementioned factors.

Edit I mentioned this in a comment, but if you integrate out the variable $x_d$ you get optimal convergence in the Gibbs sampler, for the other variables, with respect to the 3 sampling schemes. You won't have any samples of $x_d$, though, so you should only do this if you don't care about inference on $x_d$.