How to understand the Posterior hyperparameters for Bernoulli in Beta conjugate prior?

107 Views Asked by At

From here: https://en.wikipedia.org/wiki/Conjugate_prior#When_the_likelihood_function_is_a_discrete_distribution

I know $\text{posterior} = \frac{\text{proir} \cdot \text{likelyhood}}{\text{evidence}}$, but how it get $\alpha +\sum _{i=1}^{n}x_{i},\,\beta +n-\sum _{i=1}^{n}x_{i} $ from that formula and what does sum of $x_{i}$ means? Given Bernoulli is :

$$ p(x,\mu) = \mu^x(1-\mu)^{1-x} x \in \{0,1\} $$

Can some one give me some intuition about the posterior hyperparameters? e.g. explain how those term could help me get something from $\text{posterior} = \frac{\text{proir} \cdot \text{likelyhood}}{\text{evidence}}$.

2

There are 2 best solutions below

0
On BEST ANSWER

As the link explains and I quote:

Let $n$ denote the number of observations. In all cases below, the data is assumed to consist of $n$ points $x_{1}, \dots, x_{n}$

One important assumption left out is that these $x_{i}$ are independent observations, although this is as far as I have read the norm anyways.

To be explicit, each $x_{i}$ is an observation of a Bernoulli experiment with parameter $\mu$; $$p(x_{i} \mid \mu) = \mu^{x_{i}}(1-\mu)^{1-x_{i}}.$$

Let me denote $\mathcal{D} = \{x_{1}, \dots, x_{n}\}$. To get the hyperparameters for the posterior it is easiest to use the unnormalized Bayes theorem:

$$p(\mu\mid \mathcal{D})\propto p(\mathcal{D}|\mu)p(\mu).$$ Since the data in $\mathcal{D}$ is (implicitly) assumed to be independent: $$p(\mathcal{D}\mid \mu) = \prod_{i=1}^{n}p(x_{i}\mid \mu) = \prod_{i=1}^{n} \mu^{x_{i}}(1-\mu)^{1-x_{i}} = \mu^{\sum_{i=1}^{n}x_{i}}(1-\mu)^{n - \sum_{i=1}^{n}x_{i}}.$$ Which therefore implies, since $\mu\sim\text{Beta}(\alpha,\beta)$ and hence $p(\mu) \propto \mu^{\alpha-1}(1-\mu)^{\beta-1}$, that the posterior is:

$$p(\mu\mid \mathcal{D})\propto \mu^{\sum_{i=1}^{n}x_{i} + \alpha - 1}(1-\mu)^{n - \sum_{i=1}^{n}x_{i} + \beta - 1}.$$

To find the normalizing constant, you could either integrate, but more simply note that by comparing coefficients that $p(\mu\mid \mathcal{D})$ is proportional to the density of a $\text{Beta}(\alpha + \sum_{i=1}^{n} x_{i}, n + \beta - \sum_{i=1}^{n} x_{i})$ and hence the constant is the one from this Beta distribution.

0
On

Your "evidence" in the denominator is simply a scaling factor so the whole expression integrates over $\mu$ to $1$, as it needs to for a probability. We are interested in the shape of the distribution for $\mu$.

The prior for $\mu$ is a Beta distribution with density proportional to $\mu^\alpha (1-\mu)^\beta$ with $0 \le \mu \le 1$

The likelihood from the observations $x_1,x_2,\ldots,x_n \in \{0,1\}^n$ is proportional to $\mu^{x_1}(1-\mu)^{1-{x_1}}\mu^{x_2}(1-\mu)^{1-{x_2}}\cdots \mu^{x_n}(1-\mu)^{1-{x_n}}$ $= \mu^{\sum x_i}(1-\mu)^{n-\sum x_i}$

So the product of the prior for $\mu$ and the likelihood is proportional to $\mu^{\alpha+\sum x_i}(1-\mu)^{\beta+n-\sum x_i}$. I.e. the posterior for $\mu$ is also a Beta distribution but with updated parameters.