I am trying to read this paper, and I don't seem to have some of the assumed mathematical background. I am trying to understand a certain section (the beginning of 3.1), which I will reproduce here.
If anybody can just explain in greater detail what's going on, I would appreciate it.
The basic idea of the paper is to create a graph-theoretic version of the Matern kernel for a Gaussian process on a (weighted) graph. (Normally the Matern kernel is used for a Gaussian process on Euclidean space.)
Note that $\Delta$ is the graph Laplacian.
Graph Fourier feature methods approximate the kernel matrix using a truncated eigenvalue expansion. Here, we first obtain the $\ell$ smallest eigenvalues and eigenvectors of $\Delta$ using a routine designed for efficiently computing the eigenvalues of sparse matrices, such as the Lanczos algorithm. Since the mappings
\begin{align} \lambda \mapsto \left( \frac{2\nu}{\kappa^2} + \lambda\right)^{-\nu} \qquad \qquad \lambda \mapsto e^{-\frac{\kappa^2}{4}\lambda} \end{align}
are decreasing functions on $\mathbb R^+$, we thus obtain the $\ell$ largest eigenvalues of the Matern or diffusion kernels. Once this is obtained, scalable training proceeds by relating the approximate GP with a Bayesian linear regression model, mirroring Euclidean Fourier features (Rahimi and Recht, 2008).
I know what eigenvalues and eigenvectors are. I have some idea of what a Gaussian process is. I am just looking for greater explanation of what's going in the quoted section. I will try to present my limited understanding of what's going on if that helps focus my question.
- The "kernel" is the covariance function of the Gaussian process, which is critical to many of the properties of the process, e.g. differentiability.
- Is it reasonable to think of the kernel as a matrix?
- If the kernel is too large to work with, we try to approximate it.
- So we use some off-the-shelf method, like the Lanczos method, to get $\ell$ smallest eigenvalues and eigenvectors of $\Delta$.
- Why the $\ell$ smallest?
- How does extracting eigenvectors and eigenvalues help us approximate a matrix? Is it because we can form (part of) the eigendecomposition?
- Why do those two functions being decreasing mean that we get the $\ell$ largest eigenvalues of the Matern kernel?
- Why are the largest eigenvalues of the Matern kernel the most important?
Thank you.