Partial derivatives of the marginal likelihood of a Gaussian Process

493 Views Asked by At

In Chapter 5 of "Gaussian Processes for Machine Learning" by Rasmussen and Williams on page 114 (p.10 in pdf) they give the equation (5.9) to calculate the partial derivatives of the marginal likelihood w.r.t. the hyperparameters:

$$ \frac{\partial \log p(\mathbf{y} \mid X, \theta)}{\partial \theta_j} = \frac{1}{2} \mathbf{y}^\top K^{-1} \frac{K}{\partial \theta_j} K^{-1} \mathbf{y} - \frac{1}{2} \text{tr} \left(K^{-1} \frac{K}{\partial \theta_j} \right) \\ = \frac{1}{2} \text{tr} \left( (\alpha \alpha^\top - K^{-1})\frac{K}{\partial \theta_j} \right) $$

With $\alpha = K^{-1}\mathbf{y}$.

How can you derive how to get the second equation ($\frac{1}{2} \text{tr} \left( (\alpha \alpha^\top - K^{-1})\frac{K}{\partial \theta_j} \right)$) from the first equation ($\frac{1}{2} \mathbf{y}^\top K^{-1} \frac{K}{\partial \theta_j} K^{-1} \mathbf{y} - \frac{1}{2} \text{tr} \left(K^{-1} \frac{K}{\partial \theta_j} \right)$)?

I understand where the first part comes from, and how to get this derivative of the marginal likelihood, I just don't understand how they get the term with the $\alpha$'s. My linear algebra is a bit rusty, I tried do do the derivation myself, but I couldn't find a solution.

1

There are 1 best solutions below

0
On

Ok, I think I have something:

Since the trace of an inner product is just the inner product I can write:

$$\mathbf{y}^\top K^{-1} \frac{K}{\partial \theta_j} K^{-1} \mathbf{y} = \text{tr}\left(\mathbf{y}^\top K^{-1} \frac{K}{\partial \theta_j} K^{-1} \mathbf{y}\right)$$

And since $tr(ABC) = tr(BCA) = tr(CAB)$, and $K^{-1}=K^{-\top}$ because $K^{-1}$ is symmetric, and $(AB)^\top = B^\top A^\top$, I can rewrite this term as:

$$\text{tr}\left(K^{-1}\mathbf{y} \mathbf{y}^\top K^{-1} \frac{K}{\partial \theta_j} \right) = \text{tr}\left((K^{-1}\mathbf{y}) ( K^{-1} \mathbf{y})^\top \frac{K}{\partial \theta_j} \right)$$

If we fill this in our formula, and take into account that $tr(A) + tr(B) = tr(A + B)$ we can write: $$ \frac{\partial \log p(\mathbf{y} \mid X, \theta)}{\partial \theta_j} = \frac{1}{2} \mathbf{y}^\top K^{-1} \frac{K}{\partial \theta_j} K^{-1} \mathbf{y} - \frac{1}{2} \text{tr} \left(K^{-1} \frac{K}{\partial \theta_j} \right) \\ = \frac{1}{2} \left( \text{tr} \left((K^{-1}\mathbf{y}) ( K^{-1} \mathbf{y})^\top \frac{K}{\partial \theta_j} \right) - \text{tr} \left(K^{-1} \frac{K}{\partial \theta_j} \right) \right) \\ = \frac{1}{2} \text{tr} \left( (\alpha \alpha^\top - K^{-1})\frac{K}{\partial \theta_j} \right) $$

With $\alpha = K^{-1}\mathbf{y}$.