The recursive nature of the definition of messages in belief propagation on a factor graph makes it quite confusing for me what the messages exactly correspond to: http://en.wikipedia.org/wiki/Belief_propagation
The message from variable to factor, $\mu_{v \rightarrow u}(x_v)$, has $x_v$ as argument, so necessarily will be - in the end - a random variable. The same is valid for $\mu_{u \rightarrow v}(x_v)$, the message from factor to variable. In other words, the memory space required for storing these message should be equal to the cardinality of the finite random variable $x_v$.
In "Pattern Recognition and Machine Learning" (Bishop, 2011), he states in Eq. 8.74.:
$\mu_{x_1 \rightarrow f_a}(x_1) = 1$
How is this possible? Shouldn't that be:
$\mu_{x_1 \rightarrow f_a}(x_1) = x_1$
The result has to be a random variable, and while $1$ as a constant is a degenerate random variable, it doesn't seem to make sense with the next equation 8.78:
$\mu_{x_2 \rightarrow f_b}(x_2) = \mu_{f_a \rightarrow x_2}(x_2) \mu_{f_c \rightarrow x_2}(x_2)$
The message is composed out of a product of the messages from the factor nodes, for example, the one from factor $f_a$ is:
$\mu_{f_a \rightarrow x_2}(x_2)=\sum_{x_1}f_a(x_1,x_2)$
So, $\mu_{f_a \rightarrow x_2}(x_2)$ is definitely a random variable with the same cardinality as $x_2$. And a product of two random variables should be a random variable again, because a random variable is a function after all. Hence $\mu_{x_2 \rightarrow f_b}(x_2)$ is again a random variable of cardinality similar to $x_2$. And if that is the case for this $\mu_{x_i \rightarrow f_j}(x_i)$ I don't understand why that would be different for $\mu_{x_1 \rightarrow f_a}(x_1)$. However, the explanations of belief propagation that I encountered never go into detail on the spaces the variables, functions, or messages operate in.
It might be useful to distinguish between variable names "$v$" and the value "$x$" a variable can take.
Each variable $v$ has a domain $D_v$. Usually this is finite, and often it is $\{0,1\}$. A message to or from $v$ is a function $D_v\to\mathbb R_{\geq 0}$. This is like a probability distribution, but doesn't have to be normalised (depending on convention).
We start the sum-product algorithm by sending out an uninformative prior: the constant function $\mu_{v\to f}\colon D_v\to\mathbb R_{\geq 0}$ defined by $\mu_{v\to f}(x)=1$. It would be a type error to set $\mu_{v\to f}(x)=v$ or $\mu_{v\to f}(x)=x$: messages are functions taking real values, but $v$ is a variable name, and $x$ is an element of a domain $D_v$ (which I think Bishop allows to be an arbitrary finite set).
There are no random variables anywhere in the sum-product algorithm - it is deterministic. However, the marginals it computes can be viewed as a probability distribution.