Bayesian Linear Regression

572 Views Asked by At

Currently I am trying to understand Bayesian linear regression and there are several things I dont understand.

First of all we have $$ p(\beta,\sigma^2|y,X) = \frac{p(y|\beta, \sigma^2, X)p(\beta, \sigma^2)}{p(y)} $$ I assume here the model $y = X\cdot\beta + \epsilon$, where all the quantities are vectors, X is a matrix and the $\epsilon$ are drawn independently from a gaussian with zero mean and variance $\sigma^2$

We can then further split the joint prior up into $p(\beta, \sigma^2)=p(\beta|\sigma^2)p(\sigma^2)$

My questions are the following:

1.) why is the posterior a joint posterior of $\beta$ and $\sigma^2$. I would assume that it is only a posterior over $\beta$?

2.) Assuming that the posterior is a joint posterior, why is the likelihood no joint distribution over y and X, because only being a likeliehood over y, the calculus doesnt add up in Bayes theorem I think

3.) Why is the (conjugate) prior over the variance, $p(\sigma^2)$, an inverse Gamma distribution? I would just assume that both priors are Gaussian, as the likelihood is Gaussian (and the gaussian is self-conjugate)

1

There are 1 best solutions below

0
On BEST ANSWER

1.) why is the posterior a joint posterior of $\beta$ and $\sigma^2$. I would assume that it is only a posterior over $\beta$?

In Bayesian inference, you need to assign a prior for ALL unknown parameters. Here, you have two unknown parameters $\beta$ and $\sigma^2$. Then the posterior you get is over both parameters (because you are updating the priors over all unknowns).

2.) Assuming that the posterior is a joint posterior, why is the likelihood no joint distribution over y and X, because only being a likeliehood over y, the calculus doesnt add up in Bayes theorem I think

The matrix $X$ is usually assumed known (which I assume is the assumption in your case) and thus everything is conditioned on it. It is not random. Bayes theorem makes sense. In your case, which is common, the prior is independent of $X$; i.e. $p(\beta, \sigma^2 | X) = p(\beta, \sigma^2 )$

3.) Why is the (conjugate) prior over the variance, $p(\sigma^2)$, an inverse Gamma distribution? I would just assume that both priors are Gaussian, as the likelihood is Gaussian (and the gaussian is self-conjugate)

Because you know that the value of the parameter $\sigma^2$ (the variance) is always positive (nonnegative); Therefore the prior PDF has to have a nonnegative domain (support). Doesnt make much sense to assume in your prior a positive probability for $\sigma^2$ being negative. The reason for the specific choice of the inverse Gamma distribution as a prior in this case (as you indicated) is due to the conjugacy property which leads to tractable posteriors.