Background: I'm trying to implement an Expectation Maximization algorithm for Gaussian Processes. In the M-step, I'm taking the derivative of the likelihood with respect to the parameter(s) of the kernels. I'm wondering how to update the parameters in the following context:
$$ \sigma \in \mathbb{R}, \qquad K : \mathbb{R} \to \mathbb{R}^{m \times m}, \qquad f: \mathbb{R}^{m \times m} \to \mathbb{R} $$
We can think of $\sigma$ as the parameters for the kernel (e.g. Gaussian Kernel), $K$ as the Kernel matrix, and $f$ as the likelihood function.
I'm having difficulty with getting the dimensions to work out when evaluating $$ \frac{\partial f(K(\sigma)) }{\partial \sigma}$$
Can someone help with this? I'm trying to solve this using a "chain rule" approach, but the dimensions don't seem to make sense in the context of the problem. Specifically, I seem to be getting a $m \times m$ matrix instead of a scalar.
Applying the chain rule to the composition $f\circ K$, we have $D(f\circ K)(\sigma)=Df(K(\sigma))\circ DK(\sigma)$.
Writing each component out gives $\sum^{2n}_{i=1}\frac{\partial f}{dx_i}(K(\sigma))\cdot \frac{dK_i}{dt}(\sigma)$, which is a scalar.
(Actually, $D(f\circ K)(\sigma):\mathbb R\to \mathbb R$ is a linear transformation $T$ which is identified with $T(1).$)