Converting a matrix differential to a derivative

398 Views Asked by At

I would like to write down the update rule for a set of parameters in a neural network, which minimizes a loss function that I think is general enough to be instructive for others.

Let $\Phi \in \mathbb{R}^{l \times m \times n}$ be a $l \times m \times n$ tensor of learnable parameters and $\mathscr{L(\Phi)}$ be a scalar loss function of those parameters to be minimized:

$$\mathscr{L} = \beta\sum_{i=1}^{m}\sum_{j=1}^{n}\sum_{k=1}^{n}|\Phi_{i}^{\top}\Phi_{i} - \mathbb{I}_{\text{n}}|_{jk},$$

where $|\cdot|$ is element-wise absolute value, $\beta$ is some scalar constant, $\Phi_{i}$ is a $l \times n$ matrix, and $\mathbb{I}_{\text{n}}$ is the $n \times n$ identity matrix. I would like to know the derivative of this loss with respect to an $l$-dimensional vector: $\frac{\partial \mathscr{L}}{\partial \Phi_{ab}}$, where $a$ and $b$ index the $m$ and $n$ dimensions of $\Phi$, respectively.

Following the chain rule described in chapter 18 from the Matrix Differential Calculus book by Magnus and Neudecker, I can use differentials to get most of the way there. Specifically, I can modify example 18.6a to let $F(X) = |X^{\top}X|$ for some $X \in \mathbb{R}^{l \times n}$, where again $|\cdot|$ is absolute value, not determinant. Then,

\begin{align} \text{d}F &= \text{d}|X^{\top}X| \\ &= \frac{X^{\top}X}{|X^{\top}X|} \text{d}(X^{\top}X) \\ &= \frac{X^{\top}X}{|X^{\top}X|} (\text{d}X)^{\top}X + \frac{X^{\top}X}{|X^{\top}X|} X^{\top} \text{d}X \\ &= 2 \frac{X^{\top}X}{|X^{\top}X|} X^{\top}\text{d}X \end{align}

The book also provides an identification theorem for connecting differentials to derivatives: $$\text{d} \text{vec}F = A(X) \text{d} \text{vec}X \iff \frac{\partial\text{vec}F(X)}{\partial(\text{vec}X)^{\top}} = A(X),$$ where $\text{vec}$ is the matrix vectorization operator. I believe I can now use the chain rule to get close to my desired derivative if I set $F=|X^{\top}X-\mathbb{I}_{\text{n}}|$ and $X=\Phi_{i}$: \begin{align} \frac{\partial\mathscr{L}}{\partial(\text{vec}\Phi_{i})^{\top}} &= \frac{\partial\mathscr{L}}{\partial\text{vec}F} \frac{\partial\text{vec}F}{\partial(\text{vec}\Phi_{i})^{\top}} \\ &= \frac{\partial\mathscr{L}}{\partial\text{vec}F} 2 \frac{\Phi_{i}^{\top}\Phi_{i}-\mathbb{I}_{\text{n}}}{|\Phi_{i}^{\top}\Phi_{i}-\mathbb{I}_{\text{n}}|} \Phi_{i}^{\top} \end{align}

I do not know how to get from this point to a partial derivative with respect to a single vector, $\Phi_{ab}$. I would guess that almost all of the entries from the sums in $\mathscr{L}$ will be zero for $\frac{\partial \mathscr{L}}{\partial \Phi_{ab}}$. I think I can use this to my advantage, which I think would mean multiplying the above derivative by $\delta_{ia}\delta_{jb}\delta_{kb}$, but this is where I am less sure.

I also used this blog post as a resource. My question is very similar to this one, and also related to this one, this one, and this one, although I was not able to get to an answer from those posts.

1

There are 1 best solutions below

1
On BEST ANSWER

For ease of typing define the variables $$\eqalign{ P &= \phi,\quad &X=\big(P^TP-I\big) &\implies dX=\big(P^TdP+dP^TP\big) \\ A &= \operatorname{abs}(X),\quad &G = \operatorname{sign}(X) &\implies \;\, A=G\odot X \\ }$$ where $(\odot)$ is the elementwise/Hadamard product and all functions are applied elementwise. Forget about the subscripts, they'll be added later.

Note that $(G,A,X)$ are symmetric matrices.

Write the elementwise $L_1$-norm (aka Manhattan norm) of $X$ and calculate its differential. $$\eqalign{ {\mu} &= {\tt1}:A \\&= {\tt1}:(G\odot X) \\&= G:X \\ d{\mu} &= G:dX \\ &= G:(P^TdP+dP^TP) \\ &= (G+G^T):P^TdP \\ &= 2PG:dP \\ }$$ where $\tt1$ is the all-ones matrix and a colon is shorthand for the trace, i.e. $\;G\!:\!X = \operatorname{Tr}(G^T\!X)$

Append subscripts to the above result, sum, and multiply by $\beta$ to construct the loss function. $$\eqalign{ {\scr L} &= \beta\sum_i \mu_i \\ d{\scr L} &= \beta\sum_i d\mu_i = \beta\sum_i 2P_iG_i : dP_i \\ \frac{\partial\scr L}{\partial P_i} &= 2\beta\,P_iG_i \\ }$$ In terms of the original variables, the gradient is $$\eqalign{ \frac{\partial\scr L}{\partial\phi_i} &= 2\beta\,\phi_i\,\operatorname{sign}(\phi_i^T\phi_i-I) \\ }$$ NB: $\,\operatorname{sign}(z)$ has a discontinuity at $z=0$, so this gradient doesn't exist everywhere.

Since $\Phi$ is a $3$rd-order tensor, the above gradient is more clearly expressed in index notation. $$\eqalign{ \phi_i &\to \Phi_{mil} \quad \big({\rm matrix\, used\, in\, the\, preceding\, derivation}\big) \\ \frac{\partial\scr L}{\partial\phi_{i}} &\to \frac{\partial\scr L}{\partial\Phi_{mil}} \;=\; 2\beta \sum_j\Phi_{mij}\,\operatorname{sign} \left(\sum_k\Phi_{kij}\Phi_{kil}-\delta_{jl}\right) \;\doteq\; \Gamma_{mil} \\ }$$ Finally, the matrix components of the crazy derivative that was requested can be written as $$\eqalign{ Q_j &= \sum_i\sum_k \Gamma_{jik}\;e_ie_k^T \\ }$$ where $\{e_i\}$ denotes the standard Cartesian basis vector.