Questions about Bayesian inference

685 Views Asked by At

From Wikipedia

  1. The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. $p(\theta \mid \alpha )$. ...

    The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. $p(\mathbf {X}\mid \theta )$ . This is also termed the likelihood,...

    The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. $$p(\mathbf {X}\mid \alpha )=\int_{\theta }p(\mathbf {X}\mid \theta )p(\theta \mid \alpha )\operatorname {d}\!\theta .$$

    The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference: $$ p(\theta \mid \mathbf {X},\alpha )={\frac {p(\mathbf {X}\mid\theta )p(\theta \mid \alpha )}{p(\mathbf {X} \mid \alpha )}}\propto p(\mathbf {X} \mid \theta )p(\theta \mid \alpha ) $$

    In the calculation of the marginal likelihod and posterior distribution, I wonder what is the reason that $p(\mathbf {X }\mid \theta )$ is not $p(\mathbf {X} \mid \theta, \alpha )$ instead?

  2. The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior: $$ p(\tilde {x} \mid \mathbf {X},\alpha )=\int_{\theta}p(\tilde {x} \mid \theta )p(\theta \mid \mathbf {X},\alpha )\operatorname {d}\!\theta $$

    Why is $p(\tilde{x} \mid \theta )$ not $p(\tilde {x} \mid \theta, X, \alpha )$ instead?

Thanks!

3

There are 3 best solutions below

0
On

The $\alpha$ are not random variables, but parameters of the assumed prior. Hence, they aren't events, and do not contribute to the conditional probability. This is why the likelihood of $\mathbf{X}$ does not include $\alpha$.

The same holds for $\mathbf{X}$ -- those are given values in the context of $\theta$. Therefore, when forming the predictive distiribution, $p(\theta |{\mathbf {X}},\alpha )$ already incorporates the information from the data on the probable values of $\theta$, hence the data are treated like parameters in a predictive setting.

0
On

For your first question

In the calculation of the marginal likelihod and posterior distribution, I wonder what is the reason that $p({\mathbf {X}}|\theta )$ is not $p({\mathbf {X}}|\theta, \alpha )$ instead?

$\alpha$ is a parameter of the probability density for $\theta$ i.e. it's a parameter of the prior. The likelihood $p({\mathbf{X}}|\theta )$ takes $\theta$ as a parameter not $\alpha$.

A simple example should help. Consider a beta prior with parameters $(a,b)$ for a binomial probability $\rho$. In this case for a single observation, $p({\mathbf{X}}|\theta )$ is of the form $\binom{\cdot}{\cdot}\rho^\cdot(1-\rho)^\cdot$ and the prior is proportional to $(1-\rho)^a\rho^b$. Here $\alpha = (a,b)$, and $\theta = \rho$.

On the second question

Why is $p({\tilde {x}}|\theta )$ not $p({\tilde {x}}|\theta, X, \alpha )$ instead?

The details are in Eupraxis1981's answer. I would simply say that the data and the hyper parameters don't appear in $p({\tilde {x}}|\theta )$, so conditioning on them is redundant. A similar example could be constructed in this case also.

0
On

It is assumed that $X$ is independant of $\alpha\,$ given $\theta$, and also that your new point is independant of $X$ given $\theta$.