difference between variational inference and gibbs sampler

286 Views Asked by At

I am learning about variational inference and Gibbs simpler. I am in the process of deriving variational inference on my own. In this process, I need to make a comparison with Gibbs sampler. I am reading the book "Machine learning a probabilistic perspective" by Kevin Murphy and I found this sentence that I need help with

"Since we are replacing the neighboring values by their mean value, the method is known as mean field. This is very similar to Gibbs sampling (Section 24.2), except instead of sending sampled values between neighboring nodes, we send mean values between nodes."

Someone please explain how Gibbs sampling sends sampled values between neighbors and VI sends mean values between nodes. I am looking for an in-depth explanation that shows what is what and how it is meant by this comparison.

2

There are 2 best solutions below

0
On BEST ANSWER

"Since we are replacing the neighboring values by their mean value, the method is known as mean field. This is very similar to Gibbs sampling (Section 24.2), except instead of sending sampled values between neighboring nodes, we send mean values between nodes."

Let's call $q(\textbf x)$ the distribution that we're trying to sample from where $\textbf x$ (usually represented by $\theta$) is the parameter vector $(x_1,\dots,x_d)$.

Gibbs sampling works by drawing values from the conditional distribution of $x_i$ using the value passed on from the previous iterations, hence 'sending sampled values across neighboring nodes.' That is, Gibbs sampling works by: Initialize $\textbf x^{(0)}=(x_1^{(0)}, \dots, x_d^{(0)})$ On each iteration, generate each coordinate through using the most recent value: \begin{align}x_1^{(j)}&\sim q_1(x_1|x_2^{(j-1)},\dots,x_d^{(j-1)})\\ x_2^{(j)}&\sim q_2(x_2|x_1^{(j)}, x_3^{(j-1)},\dots,x_d^{(j-1)})\\ &\vdots\\ x_d^{(j)}&\sim q_d(x_d|x_1^{(j)},\dots,x_{d-1}^{(j)})\end{align}

Mean field method first assumes that the posterior distribution factorizes into some subset of the parameters:

$$q(\textbf x) = \prod _{i=1}^D q_i(\textbf x_i)$$ where $D<d$.

The update rule for each iteration is:

$$\log q_j(\textbf x_j) = \mathbb E_{-q_j}[\log \tilde p(\textbf x)]+\text{const}$$

where for example if there are three parameters $\mathbb E_{-q_2}[f(\textbf x)] = \sum_{x_1}\sum_{x_3}q(x_1)q(x_3)f(x_1,x_2,x_3)$

This can be written iteratively for one iteration as:

\begin{align} \log q_1(\textbf x_1) &= \mathbb E_{-q_1}[\log \tilde p(\textbf x)]+\text{const}\\ \log q_2(\textbf x_2) &= \mathbb E_{-q_2}[\log \tilde p(\textbf x)]+\text{const}\\ \vdots\\ \log q_D(\textbf x_D) &= \mathbb E_{-q_D}[\log \tilde p(\textbf x)]+\text{const} \end{align}

Since we are using the Expected Value (mean) of say $q_1$ in the second step for $q_2$'s expected value this is like passing the mean from step to step.

1
On

For every node $x_n$ in a graphical model, we can use Gibb's sampling to sample from them in an iterative fashion: $$ x^s_{n}\sim P(x_n|x_{mb}^{s-1}) $$ where $x_{mb}$ is the Markov blanket of $x_n$. For every neighboring node $x_{i}^{s-1}$ of $x_n^s$, we update them in the same fashion (using $x_n^s$): $$ x_i^s\sim P(x_i|x_n^s, x_{mb}^{s-1}) $$ We continue this until all nodes are updated in iteration $s$. This is what the book meant by sending sampled values between neighboring nodes; We update each node by sampling from the neighboring node.


Mean field VI on the other hand finds the distribution $q$ such that it's KL divergence between the target distribution is minimized. The assumption of mean field is that $$ q_\psi(x_1\cdots x_n)=q(x_1)\cdots q(x_n) $$ Derivations shows that we can get $q(x_i)$ by updating them iteratively for $j=1:n$ and $s = 1:T$

$$ g_s(z_j)=E_{z_{mb_j}}[\log P(z_j,z_{mb_j})]\\ q_s(z_j)\propto \exp(g_s(x_j)) $$ As shown, we get the mean of the complete log joint of the nodes with respect to the Markov blanket of that node to update the Variational parameters for that node.


Gibb's sampling is a form of Metropolis Hastings Algorithm (an MCMC method) with 100% acceptance rate while Mean Field is a VI method that optimizes surrogate distributions instead of directly sampling from the target distribution.

Further details in this book