Why the mean value of a Gaussian process is usually set to zero?

5.3k Views Asked by At

In most textbooks (e.g. Rasmussen's book on Gaussian Processes for Machine Learning) the mean value of a gaussian process is set to zero. Of course, this does not mean that all the values are expected to be zero since we are looking for the Maximum a Posteriori estimate of these variables, which do not have any more a zero mean.

Is there a mathematically robust proof or at least an explanation based on mathematics that the assumption of the zero mean value is not too prohibitive for the a posteriori estimate?

To make things a bit more clear, assume that we have the following model where the noise $e$ is uncorrelated with $f(x)$:

$y= f(x) +e $,

$f(x) \sim \mathcal{N} (m, K)$,

$e \sim \mathcal{N} (0, \sigma^2)$.

Then the a posteriori (which is actually the MAP estimate) is given by

$\mathbb{E} (f | y) = m + K (\sigma^2 I + K)^{-1} (y-m)$

The way I see it, by setting the mean equal to zero we usually reduce the number of hyperparameters that have to be estimated, because most of the times the mean is also unknown, so it has to be parametrized by some hyperparameters and then these hyperparameters have to be estimated by a non-convex optimization routine, thus increasing the computational burden. However, this does not explain mathematically if the assumption that the mean is zero does not restrict too much the a posteriori estimate.

Any help would be appreciated!

2

There are 2 best solutions below

0
On

I think the mean is set to zero in most textbooks to make the conditional expectations easier to read and understand. However there is a section in Rasmussen's GPML that discusses the use of a mean function in the GP regression (see section 2.7)

In practice I would think most people would impose a mean function of linear basis i.e $w^TX$, this allows one to think of the Gaussian process as a two stage model of first characterizing the mean, and then accounting for the covariance (or correlation) around the mean - just like Kriging. The benefit of having the mean is allowing for more control on your modelling process and not just treating everything like a black box. Also having a mean function can help prediction into regions where you are low on data.

In terms of the question Can a GP learn any function? That is a tough question - some Kernels like the so called squared exponential kernel over $R^M$ lie dense in the space of continuous functions, but if your function is not well represented by the kernel (i.e has low prior density) then the number of data points required to achieve a given error can be exponential giving no guarantees at all.

0
On

To cite Chiles and Delfiner in Geostatistics Modeling Spatial Uncertainty (2012) on page 151 for the derivation of Simple Kriging (which essentially is GPR for a Gaussian assumption):

The estimator $Z^*$ becomes

$Z^*=m_0+\sum_\alpha\lambda_\alpha(Z_\alpha-m_\alpha)$

This amounts to estimating the zero-mean variable $Y(x)=Z(x)-m(x)$ by the linear estimator

$Y^*=\sum_\alpha\lambda_\alpha Y_\alpha$

and adding the mean afterwards. Thus we have established that the case of a known mean is equivalent to the case of a zero mean with $\lambda_0=0$, and from now on in this section we will consider that $Z(x)$ has a zero mean.

In other words: it would not make a difference if you use the mean of the data or assume a zero mean with normalized data and add the mean afterwards. Rasmussen and Williams say on page 13 in Gaussian Processes for Machine Learning (2006) that the mean is set to zero only for simplicity, but it does not necesseraily have to be zero. Usually, when you assume a known mean (as in SK or GPR when the mean is not individually modeled), you have two options:

  1. Assume the mean to actually be zero and just perform GPR
  2. Normalize your observed data (so the new mean really is zero), then solve your GPR interpolation on the normalized data and add the mean afterwards (which is the exact same as described in Chiles and Delfiner (2012))

Both options will give a very similar interpolation and variance prediction for interpolation but differ in extrapolation. The reason for this is that the extrapolated value will go towards the assumed mean after a set distance which is proportional to the length scale parameter. For option 1 this will exactly be zero. In contrast to that, option 2 will first go to zero but due to the added mean value the extrapolation will result in the mean before normalization.