I would like to use Gaussian processes (GP) for Bayesian classification of medical data. I think I already understand the basic stuff but I have some uncertainties that are perhaps partly related to the notation. It is known that in GP regression (assuming Gaussian likelihood function) the prediction on unknown data using GP can be obtained by using closed form formulas that are based on marginalizing jointly Gaussian multivariate distribution. However, to get more insight into the connection of GP with the Bayesian framework, I would like to see the derivation (or at least understand the idea behind this) using Bayes rule. I am interested in that since it can help me to understand what is happening in the case that likelihood is non-Gaussian and hence an elegant solution is no longer possible. I tried to find the solution to my questions on the Internet but to no avail.
The notation I show further was taken from the online lecture on GP: http://videolectures.net/mlss2012_cunningham_gaussian_processes/?q=gaussian%20processed
It is assumed that observed data values y are defined in terms of noisy-free function f with added Gaussian noise $\varepsilon $.
$$ y=f+\varepsilon $$
I would solve the problem first by calculating the posterior distribution (1).
[1] Data posterior: $$ p\left ( f \right| y)=\frac{p\left (y|f\right)p\left (f\right )}{p\left (y \right )} $$
However, in order to get posterior (1), one needs to calculate the marginal likelihood, which is generally difficult.
[2] Marginal likelihood:
$$ p\left ( y \right )=\int p\left ( y|f \right )p\left (f \right )df $$
Finally, when one obtains the posterior distribution, it is possible to get the predictions by calculating other nasty integrals (3, 4).
[3] Predictive posterior: $$ p\left ( f^{*} \right| y)=\int {p\left (f^{*}|f\right)p\left (f|y\right )}df $$
[4] Predictive distribution: $$ p\left ( y^{*} \right| y)=\int{p\left (y^{*}|f^{*}\right)p\left (f{^{*}}|y\right )}df^{*} $$
When I tried to do these calculations, I run into the problems that generated many questions:
- First,
I don't know how to make integrations over the space of functions (equations 2-4). Specifically, marginalization (upon computing marginal likelihood [2]) with respect to the latent function f specified by Gaussian process prior p(f) makes a little sense for me.
- Second,
It is known that for Gaussian likelihood (which is the case), the conjugated prior for the Gaussian process prior is again Gaussian process. It means that posterior probability distribution is Gaussian process and should integrate to 1 if it is valid probability distribution. I don't know what it means that Gaussian process integrate to 1 since Gaussian process is a distribution on function defined via mean function and covariance function (kernel). It is hard to see any relation of GP to some probability density function (pdf) over functions.
- Third,
if we want to do classification, we need change Gaussian likelihood to some function that maps any input given by prior to <0,1> interval. Posterior is not Gaussian process in this case but one should expect that posterior should remain some infinite dimensional distribution since it is a result of the product of likelihood and GP prior (which is itself infinite dimensional). In case it is true, does not have such a distribution positive semidefinite kernel (otherwise it can be GP) ???
- Fourth,
What is the predictive posterior good for [3] ??. I consider useful only the predictive distribution [4], which give us the predictions on new values of y* [4]. It is however sometimes denoted also as the predictive posterior or posterior predictive, which is a bit confusing.
- Last,
The posterior must be given by the product of prior and likelihood function. However, I do read often that in the case of GP classification, GP prior is squeezed between <0,1> (using eg. logistic or probit function) rather than being multiplied by sigmoid likelihood function. I am so confused.
Any help will be appreciated. Thanks a lot.