Chain rule for [scalar to matrix to scalar]

132 Views Asked by At

Background: I'm trying to implement an Expectation Maximization algorithm for Gaussian Processes. In the M-step, I'm taking the derivative of the likelihood with respect to the parameter(s) of the kernels. I'm wondering how to update the parameters in the following context:

$$ \sigma \in \mathbb{R}, \qquad K : \mathbb{R} \to \mathbb{R}^{m \times m}, \qquad f: \mathbb{R}^{m \times m} \to \mathbb{R} $$

We can think of $\sigma$ as the parameters for the kernel (e.g. Gaussian Kernel), $K$ as the Kernel matrix, and $f$ as the likelihood function.

I'm having difficulty with getting the dimensions to work out when evaluating $$ \frac{\partial f(K(\sigma)) }{\partial \sigma}$$

Can someone help with this? I'm trying to solve this using a "chain rule" approach, but the dimensions don't seem to make sense in the context of the problem. Specifically, I seem to be getting a $m \times m$ matrix instead of a scalar.

3

There are 3 best solutions below

0
On

Applying the chain rule to the composition $f\circ K$, we have $D(f\circ K)(\sigma)=Df(K(\sigma))\circ DK(\sigma)$.

Writing each component out gives $\sum^{2n}_{i=1}\frac{\partial f}{dx_i}(K(\sigma))\cdot \frac{dK_i}{dt}(\sigma)$, which is a scalar.

(Actually, $D(f\circ K)(\sigma):\mathbb R\to \mathbb R$ is a linear transformation $T$ which is identified with $T(1).$)

0
On

By identifying the space of $m$-square real matrices $\operatorname{Mat}_m(\mathbb R)$ with the Euclidean space $\mathbb R^{m^2}$, meaning we "straighten out" the matrix into a column vector, we obtain the following chain of functions: $$\sigma \in \mathbb R \xrightarrow{\quad K\quad } \mathbb R^{m^2} \xrightarrow{\quad f \quad} \mathbb R. $$ The chain rule applies as in standard multivariable calculus: $$\frac{d(f \circ K)}{d\sigma}(\sigma) = \sum_{j=1}^{m^2} \frac{\partial f}{\partial y^j}(K(\sigma)) \frac{d K^j}{d\sigma}(\sigma); $$ as you can see, this quantity is a scalar, as we would like to see since $(f \circ K) : \mathbb R \to \mathbb R$.

0
On

Since you don't explicitly show either of the functions, I'll assume that you don't need help calculating their matrix-valued gradients $$\eqalign{ \def\c#1{\color{red}{#1}} \def\s{\sigma} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\deriv#1#2{\frac{d #1}{d #2}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \grad{K}{\s} \quad{\rm or}\quad \grad{f}{K} \\\\ }$$ Write the differential of $f(K),\:$ do a change of variables, then recover the derivative wrt $\s$ $$\eqalign{ df &= \gradLR{f}{K}:\c{dK} \\ &= \gradLR{f}{K}:\c{\gradLR{K}{\s}\,d\s} \\ \deriv{f}{\s} &= \gradLR{f}{K}:\gradLR{K}{\s} \\ }$$ where a colon denotes the matrix inner product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \qquad \{ {\rm Frobenius\;norm} \} \\\\ }$$