In Christopher M. Bishop's Pattern Recognation and Machine Learning it has been stated that when we have an unknown mean and unknown precision, the conjugate prior is a normal-gamma distribution
Let's assume that if our likelihood is as follows:
$p(y | Φ, w, β) = \prod_{i=1}^N $N$(y_i| w^T Φ(x_i), β^{−1})$
Therefore the conjugate prior for both $w$ and $β$ should be
$p(w|β)=$N$(w| m_0, β^{−1}S_0)Gamma(β|a_0,b_0)$
How can I show that the posterior distribution takes the same form as the prior? In other words how can I prove:
$p(w,β|D)=$N$(w| m_N, β^{−1}S_N)Gamma(β|a_N,b_N)$
I believe giving expressions for $m_N , S_N , a_N ,$ and $b_N$ could make this easier. Yet normal and gamma distributions are very complex to me unfortunately. Any help would be appreciated.
I don't think it's necessary to reproduce the derivation here because it is just standard but long computations! You can find a complete derivation here
https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf
If there's anything that is not clear. I'm happy to explain!
The model is, I believe, $y = m + \epsilon$, where $m$ is the mean function and $\epsilon$ is the Gaussian error. The mean function is a unknown function of input $x_i$, $i=1,\dots,n$, and some parameter $w$. In your particular setting, your mean function is some sort of linear combination of some basis function $\phi$ of your data. As a simple example $\phi_i(x) = x^i$, then $m(w,x) = w_1x+w_2x^2+\dots$. You need to estimate the parameter $w$. So then it's equivalent to working with transformed data, say originally you have $x_{11},x_{12},\dots$, $x_{21},x_{22},\dots$, now you have $\phi(x_{11}),\phi(x_{12}),\dots$, $\phi(x_{21}),\phi(x_{22}),\dots$ and then estimate the $w$ exactly the same as when you estimate the $b$ in linear regression. So when you do inference on $w$, you put a prior on it, in this case a Normal prior, now you have a model for the mean $m(w,x)$. Recall $\epsilon$ is Gaussian error, its variance is also need to be estimated, that's why you use NIG prior.
In a standard problem you have $y_1,y_2,\dots$ from a Gaussian distribution with some mean and variance, and you want to estimate them. In your case, in addition to $y$, you also have predictors $x$, the mean in the standard problem is now some linear combination of the $x$, now you need to estimate those $w$ (how those $x$ are combined). In other words, if you estimate the mean as in the standard problem and know that the mean follows some parametric form ($w^T\phi(x)$), then you can solve for $w$.
If you have specific questions I might be able to provide better help.