In most textbooks (e.g. Rasmussen's book on Gaussian Processes for Machine Learning) the mean value of a gaussian process is set to zero. Of course, this does not mean that all the values are expected to be zero since we are looking for the Maximum a Posteriori estimate of these variables, which do not have any more a zero mean.
Is there a mathematically robust proof or at least an explanation based on mathematics that the assumption of the zero mean value is not too prohibitive for the a posteriori estimate?
To make things a bit more clear, assume that we have the following model where the noise $e$ is uncorrelated with $f(x)$:
$y= f(x) +e $,
$f(x) \sim \mathcal{N} (m, K)$,
$e \sim \mathcal{N} (0, \sigma^2)$.
Then the a posteriori (which is actually the MAP estimate) is given by
$\mathbb{E} (f | y) = m + K (\sigma^2 I + K)^{-1} (y-m)$
The way I see it, by setting the mean equal to zero we usually reduce the number of hyperparameters that have to be estimated, because most of the times the mean is also unknown, so it has to be parametrized by some hyperparameters and then these hyperparameters have to be estimated by a non-convex optimization routine, thus increasing the computational burden. However, this does not explain mathematically if the assumption that the mean is zero does not restrict too much the a posteriori estimate.
Any help would be appreciated!
I think the mean is set to zero in most textbooks to make the conditional expectations easier to read and understand. However there is a section in Rasmussen's GPML that discusses the use of a mean function in the GP regression (see section 2.7)
In practice I would think most people would impose a mean function of linear basis i.e $w^TX$, this allows one to think of the Gaussian process as a two stage model of first characterizing the mean, and then accounting for the covariance (or correlation) around the mean - just like Kriging. The benefit of having the mean is allowing for more control on your modelling process and not just treating everything like a black box. Also having a mean function can help prediction into regions where you are low on data.
In terms of the question Can a GP learn any function? That is a tough question - some Kernels like the so called squared exponential kernel over $R^M$ lie dense in the space of continuous functions, but if your function is not well represented by the kernel (i.e has low prior density) then the number of data points required to achieve a given error can be exponential giving no guarantees at all.