Why can we marginalize out a likelihood function by expectation/integration so that it depends on the hyper-parameters?

331 Views Asked by At

Suppose that $P \sim f(p\mid \alpha, \beta)$ and that $Q\sim f(q\mid \gamma, \delta)$. Now suppose that the likelihood function for $P,Q$ as a function of data $x$ is given as:

$$ L(p,q\mid x) $$

I am wondering why it is the case that we have that the likelihood of the hyper-parameters $\alpha, \beta, \gamma,\delta$ can be written as:

$$ L(\alpha, \beta, \gamma,\delta\mid x) = \iint L(p,q\mid x)f(p\mid\alpha, \beta)f(q\mid\gamma, \delta)dpdq $$

I saw this statement above in a paper where they said that this was taking the expectation. It seems to me that the above is just marginalizing out $p$ and $q$. However, I am wondering what the explicit form of $L(p,q\mid x)$ is above?

Is it true that:

$$ L(p,q\mid x) = p(x\mid p,q) = \frac{p(x, p, q)}{f(p)f(q)} $$

and hence we have:

$$ \iint L(p,q\mid x)f(p\mid\alpha, \beta)f(q\mid \gamma, \delta) \, dp \, dq = \iint p(x, p, q)\,dp\,dq \text{ ?} $$

This doesn't make sense as I thought that $L(p,q\mid x)$ should be viewed as a function of $p,q$ for fixed $x$? Additionally, why is it now $f(p)$ doesn't contain the hyper-parameter conditional of $f(p\mid \alpha, \beta)$?

1

There are 1 best solutions below

3
On

Basically we can construct a Bayesian network (a directed acyclic graph, DAG) of the hyperparameters' influence on the parameters and their influence on the random variable: $$\begin{array}{c} \alpha & & \beta & & & & \gamma & & \delta\\ & \searrow & \downarrow &&&& \downarrow &\swarrow \\ && p &&&& q\\ &&& \searrow && \swarrow\\ &&&& x\end{array}$$

From this we can see that:

$$\begin{align}\mathcal L(\alpha,\beta,\gamma,\delta\mid x) ~&=~ f(x\mid \alpha,\beta,\gamma,\delta) \tag 1 \\ &=~ \iint f(x, p,q\mid \alpha,\beta,\gamma,\delta)\operatorname d (p,q) \tag 2 \\ &=~ \iint f(x \mid p,q, \alpha,\beta,\gamma,\delta)\,f(p,q\mid \alpha,\beta,\gamma,\delta) \operatorname d (p,q) \tag 3 \\ &=~ \iint f(x\mid p,q)\,f(p,q\mid \alpha,\beta,\gamma,\delta)\operatorname d (p,q) \tag 4 \\ &=~ \iint f(x\mid p,q)\,f(p\mid \alpha,\beta)\,f(q\mid \gamma,\delta)\operatorname d (p,q) \tag 5 \\ &=~ \iint \mathcal L(p,q\mid x)\,f(p\mid \alpha,\beta)\,f(q\mid \gamma,\delta)\operatorname d (p,q) \tag 6\end{align}$$

(1, 6) by definition of a Likelihood function.

(2) by the Law of Total Probability

(3) by Conditioning

(4) The variable $x$ and hyperparameters $\{\alpha,\beta,\gamma,\delta\}$ are conditionally independent for given parameters $\{p,q\}$.   $f(x\mid p,q,\alpha,\beta,\gamma,\delta)=f(x\mid p,q)$.

(5) The subgraph formed by nodes $\{\alpha, \beta, p\}$ is independent of that formed by nodes $\{\gamma,\delta, q\}$