What is the exact nature of messages in belief propagation on a factor graph?

250 Views Asked by At

The recursive nature of the definition of messages in belief propagation on a factor graph makes it quite confusing for me what the messages exactly correspond to: http://en.wikipedia.org/wiki/Belief_propagation

The message from variable to factor, $\mu_{v \rightarrow u}(x_v)$, has $x_v$ as argument, so necessarily will be - in the end - a random variable. The same is valid for $\mu_{u \rightarrow v}(x_v)$, the message from factor to variable. In other words, the memory space required for storing these message should be equal to the cardinality of the finite random variable $x_v$.

In "Pattern Recognition and Machine Learning" (Bishop, 2011), he states in Eq. 8.74.:

$\mu_{x_1 \rightarrow f_a}(x_1) = 1$

How is this possible? Shouldn't that be:

$\mu_{x_1 \rightarrow f_a}(x_1) = x_1$

The result has to be a random variable, and while $1$ as a constant is a degenerate random variable, it doesn't seem to make sense with the next equation 8.78:

$\mu_{x_2 \rightarrow f_b}(x_2) = \mu_{f_a \rightarrow x_2}(x_2) \mu_{f_c \rightarrow x_2}(x_2)$

The message is composed out of a product of the messages from the factor nodes, for example, the one from factor $f_a$ is:

$\mu_{f_a \rightarrow x_2}(x_2)=\sum_{x_1}f_a(x_1,x_2)$

So, $\mu_{f_a \rightarrow x_2}(x_2)$ is definitely a random variable with the same cardinality as $x_2$. And a product of two random variables should be a random variable again, because a random variable is a function after all. Hence $\mu_{x_2 \rightarrow f_b}(x_2)$ is again a random variable of cardinality similar to $x_2$. And if that is the case for this $\mu_{x_i \rightarrow f_j}(x_i)$ I don't understand why that would be different for $\mu_{x_1 \rightarrow f_a}(x_1)$. However, the explanations of belief propagation that I encountered never go into detail on the spaces the variables, functions, or messages operate in.

2

There are 2 best solutions below

2
On BEST ANSWER

It might be useful to distinguish between variable names "$v$" and the value "$x$" a variable can take.

Each variable $v$ has a domain $D_v$. Usually this is finite, and often it is $\{0,1\}$. A message to or from $v$ is a function $D_v\to\mathbb R_{\geq 0}$. This is like a probability distribution, but doesn't have to be normalised (depending on convention).

We start the sum-product algorithm by sending out an uninformative prior: the constant function $\mu_{v\to f}\colon D_v\to\mathbb R_{\geq 0}$ defined by $\mu_{v\to f}(x)=1$. It would be a type error to set $\mu_{v\to f}(x)=v$ or $\mu_{v\to f}(x)=x$: messages are functions taking real values, but $v$ is a variable name, and $x$ is an element of a domain $D_v$ (which I think Bishop allows to be an arbitrary finite set).

There are no random variables anywhere in the sum-product algorithm - it is deterministic. However, the marginals it computes can be viewed as a probability distribution.

0
On

The code I used myself in the end:

/**
 * The implementation of the vertex class. In case of an Ising variable typename
 * "T" would be a boolean. Note that for an image segmentation task you will
 * need to map your samples (for example a grayscale image to real numbers:
 * probabilities). Small bins will result in huge conditional probability table
 * on the factors.
 *
 * The following template parameters can be used:
 * @param T     type of an event, e.g. for a ternary event you will need at
 *              least an (u)int8_t type, default: uint8_t
 * @param S     type of state on the vertex, for variable nodes this is an event
 *              (probability), for factor nodes this is a conditional
 *              probability table
 * @param P     type to indicate probabilities, should be floating point, e.g.
 *              float or double, default: float
 * @param M     type of the messages that are communicated between the nodes on
 *              the graph, default: probability<P,T>
 * @param N     type of the index to the nodes, e.g. uint8_t would limit the
 *              graph to 2^8-1 nodes, default: size_t
 */
template <typename T, typename S, typename P = float,
        typename M = probability<P,T>, typename N = size_t>
class vertex {
   // content
};

You see the message is of type probability<P,T>, which is like a probability mass function, but only storing the probabilities of the outcomes that are nonzero, and not enforcing summing up to one so that unnormalized messages can be communicated.