"Marginalizing out" a parameter in a PDF

5.6k Views Asked by At

Say I have a variable $x$ whose PDF is a von Mises distribution with a single parameter, $\kappa$, that determines the distribution's concentration: $VM(x; \kappa)$.

But suppose $\kappa$ itself is not fixed, but drawn from a Gamma distribution: $G(\kappa; \bar{\kappa}, \tau)$, where the mean is $\bar{\kappa}$ and the variance is $\bar{\kappa}\tau$.

An author whose work I'm replicating has this to say about this situation: "The predictions [for each $x$ value's probability] have to be averaged over all possible values of precision $\kappa$ (mathematically: we need to 'marginalize out $\kappa$'). We do this by discretizing the (gamma) distribution over $\kappa$ into 50 bins with equal probability masses, computing the model [von Mises] prediction at each bin center, and then averaging the predictions. This solution is accurate as long as the number of bins is not too small (in the limit of an infinite amount of bins, the solution is equal to an analytical marginalization)" (emphasis mine).

I've read elsewhere that probabilities should be summed, not averaged, in marginalization. Another confusion I have regards this author's statement that the average should be "over all possible values of $\kappa$." This seems to imply that each possible value of $\kappa$ should be treated equally -- but then we're talking about a uniform distribution, not a Gamma one. Clearly, the author was using language loosely.

Can someone shed some light on this, maybe with language that is more precise?

3

There are 3 best solutions below

2
On BEST ANSWER

This is basically a hierarchical model. Let $X | \theta$ be some random variable that follows some parametric distribution with parameter $\theta$, where $\theta$ itself follows some other parametric distribution with hyperparameter $\lambda$. (Note that these quantities may be vector-valued to handle the case where the distributions described have more than one parameter/hyperparameter).

Then the marginal distribution of $X$ is given by $$f_X(x) = \int_{\theta \in \Omega} f_{X \mid \theta}(x | \theta) f_{\theta}(\theta) \, d\theta,$$ where $\Omega$ is the support of the distribution of $\theta$. This is in some sense a weighted average of the conditional density of $X$ given $\theta$, weighted by the density of $\theta$ over its support.

There are several ways to perform a numerical approximation by discretization of this integral.

Method 1

One way is to sample from $\Omega$ in a uniform manner, calculate the density at each point, and use these as the weights in a discrete sum $$f_X(x) \approx \sum_{k = 1}^\infty f_{x \mid \theta}(x | \theta_k) f_\theta(\theta_k) \Delta_k,$$ where $\Delta_k = \theta_{k+1} - \theta_k$ is the interval between the regularly spaced samples $\{\theta_k\}_{k=1}^\infty$.

Method 2

Another way would be to choose a variable spacing; e.g., $$\theta_k = \psi(k),$$ for some function $\psi$ that is related to the density in a way that $$\Pr\left[\frac{\psi(k-1) + \psi(k)}{2} < \theta \le \frac{\psi(k) + \psi(k+1)}{2}\right] = \text{constant}, \quad k = 1, 2, \ldots.$$ If you do this, then the weighting by the density is no longer needed, and you only need to average by the number of samples.

Method 3

Finally, the method that is proposed in the question itself is to set some positive integer $N$, and calculate a sequence $\{\varphi(k)\}_{k=0}^N$ with $\varphi(0) = -\infty$ and $\varphi(N) = \infty$ such that $$\Pr[\varphi(k-1) < \theta \le \varphi(k)] = 1/N$$ for all $k = 1, 2, \ldots, N$. Then define $\theta_k = (\varphi(k-1) + \varphi(k))/2$ and use this as the points to be sampled (this, as well as the previous method described, is problematic at infinity, but can be addressed somewhat by using large $N$ and ignoring the tails).

For this last method, as for the second one, weighting by density is not needed; you only need to average by the number of samples.

0
On

Suppose you have: \begin{align} \Pr(X = 0 \mid k=0) & = 2/3 & & & \Pr(X=1\mid k=0) & = 1/3 \\[10pt] \Pr(X = 0 \mid k=1) & = 1/2 & & & \Pr(X=1\mid k=1) & = 1/2 \end{align} If you want the marginal distribution of $X$, you average the two rows above, with weight equal to $\Pr(k=0)$ for the first row and to $\Pr(k=1)$ for the second row. Let's suppose those weights are each $1/2$. Then the joint distribution is: \begin{align} \Pr(X=0\ \&\ k=0) & = 1/3 & & & \Pr(X=1\ \&\ k=0) & 1/6 \\[10pt] \Pr(X=0\ \&\ k=1) & = 1/4 & & & \Pr(X=1\ \&\ k=1) & 1/4 \end{align} Then rather than averaging, you sum, to get \begin{align} \Pr(X=0) = & \frac 7 {12}, & & & \Pr(X=1) & = \frac 5 {12}. \end{align}

In one case you have a conditional distribution given $k$, and you find an average; in the other case you have a joint distribution, and you find the sum. Either way you get the same marginal distribution.

In the situation you describe, the von Mises distribution is a conditional distribution given the value of $\kappa$.

The average should be a weighted average with weights given by the distribution of $\kappa$.

0
On

You have a conditional density distribution: $f_{X\mid K}(x\mid k) = VM(x;\kappa)$

By "analytic marginalisation" the author means:

$$\begin{align}f_X(x) ~=~ & \mathsf E\big(f_{X\mid K}(x\mid K)\big) \\[1ex] ~=~ & \int_0^\infty f_{X\mid K}(x\mid \kappa)~f_K(\kappa) ~\operatorname d \kappa\end{align}$$

However the author claims that as long as the number of bins is not too small, this can approximated by "averaging out".