I understand Gibbs sampling is a means of statistics inference, and it seems that sometimes certain variables can be integrated out in the sampling process, known as collapsed Gibbs sampling. I really want to know in what circumstances the collapsed Gibbs sampling can be applied, and which variables can be integrated out?
I did some search on Google, and it appears there are no detailed explanation on it. Though there are some papers on applying collapsed Gibbs sampling on Latent Dirichlet Allocation (LDA), I am no expert of MCMC and have no idea what LDA is, so it may be hard for me to read those papers. Can someone answer this question and it would be better, provide some examples? Much appreciated!
I mentioned this paper in a comment, but the explanation of why the Collapsed Gibbs sampler is efficient was kind of technical.
Suppose you had a distribution $\Pi(x_1,...,x_d)$ (using the notation in that paper). A Gibbs sampler finds the conditional distributions and samples from them to construct a Markov chain with the distribution as it's stationary distribution. So you'd sample from:
Where the notation $x_{-k}$ means all variables minus variable $k$, so each step samples from the variable conditioned on all the other variables.
Sometimes these variables are sampled grouped together, such as sampling $x_{d-1},x_{d}| x_{(-d-1,-d)}$.
One thing you could do if you had the ability to perform the integration, and you weren't worried about the inference on the parameter, is integrate out $x_{d}$ from that conditional distribution to sample from the conditional distribution of $x_{d-1}| x_{(-d-1,-d)}$ (you would need to know that the conditional distributions of the other variables doesn't depend on $x_{d}$).
Gibbs samplers have autocorrelation between the samples. This is why many people thin the samples to reduce the autocorrelation. The three schemes theorem presented in that paper shows that, for the following three approaches to sampling:
Then the autocorrelation of 3 is less than the autocorrelation of 2 which is less than the autocorrelation of 1.
However, sometimes you might not be able to, strictly speaking, optimize the efficiency of the sampler because of mathematical intractability, or maybe the actual sampling of $x_d,x_{d-1}$ jointly is more difficult than the sampling of them individually. Quoting the last paragraph of section 4 of that paper:
Edit I mentioned this in a comment, but if you integrate out the variable $x_d$ you get optimal convergence in the Gibbs sampler, for the other variables, with respect to the 3 sampling schemes. You won't have any samples of $x_d$, though, so you should only do this if you don't care about inference on $x_d$.