Replace $X$ with $\mbox{diag}(x)$ in trace matrix derivative identity

573 Views Asked by At

There is a scaler-by-matrix derivative identity:

$$\frac{\partial}{\partial X}trace\left(AXBX'C\right)=B'X'A'C'+BX'CA$$

How does this change if instead I am trying to find

$$\frac{\partial}{\partial x}trace\left(Adiag(x)Bdiag(x)'C\right)$$

where $x$ is a vector rather than a matrix.

My thinking is that all I have to do is multiply the original identity by a vector of ones as that would be the derivative of $diag(x)$. However, I'm not sure how the chain rule interacts with traces.

I ask as I am trying to calculate. $$\frac{\partial}{\partial w}trace\left(Ddiag(w)\Omega diag(w)D'\right)$$

where $w \mathbb{\in R^{N}}$, $D\mathbb{\in R^{M\times N}}$, and $\Omega\mathbb{\in R^{N\times N}}$. Also $\Omega$ can be assumed to be positive definite.

This implies the result would be

$$\left(2\Omega diag(w)D'D\right)e$$

where $e \mathbb{\in R^{N}}$ is a vector of ones.

2

There are 2 best solutions below

2
On BEST ANSWER

Let $f : \mathbb R^n \to \mathbb R$ be defined by

$$f (\mathrm x) := \mbox{tr} \left( \mathrm A \, \mbox{diag} (\mathrm x) \, \mathrm B \, \mbox{diag} (\mathrm x) \, \mathrm C \right)$$

where $\mathrm A \in \mathbb R^{m \times n}$, $\mathrm B \in \mathbb R^{n \times n}$ and $\mathrm C \in \mathbb R^{n \times m}$ are given. The directional derivative of $f$ in the direction of $\mathrm v \in \mathbb R^n$ at $\mathrm x \in \mathbb R^n$ is given by

$$\begin{array}{rl} \displaystyle\lim_{h \to 0} \dfrac{f (\mathrm x + h \,\mathrm v) - f (\mathrm x)}{h} &= \mbox{tr} \left( \mathrm A \, \mbox{diag} (\mathrm v) \, \mathrm B \, \mbox{diag} (\mathrm x) \, \mathrm C \right) + \mbox{tr} \left( \mathrm A \, \mbox{diag} (\mathrm x) \, \mathrm B \, \mbox{diag} (\mathrm v) \, \mathrm C \right)\\ &= \mbox{tr} \left( \mbox{diag} (\mathrm v) \, \mathrm B \, \mbox{diag} (\mathrm x) \, \mathrm C \, \mathrm A \right) + \mbox{tr} \left( \mbox{diag} (\mathrm v) \, \mathrm C \, \mathrm A \, \mbox{diag} (\mathrm x) \, \mathrm B \right)\\ &= \mathrm v^\top \mbox{diag}^{-1} \left( \mathrm B \, \mbox{diag} (\mathrm x) \, \mathrm C \, \mathrm A \right) + \mathrm v^\top \mbox{diag}^{-1} \left( \mathrm C \, \mathrm A \, \mbox{diag} (\mathrm x) \, \mathrm B \right)\end{array}$$

where $\mbox{diag}^{-1} : \mathbb R^{n \times n} \to \mathbb R^n$ is a linear function that takes a square matrix and extracts its main diagonal as a column vector. Thus, the gradient of $f$ is

$$\nabla_{\mathrm x} f(\mathrm x) = \color{blue}{\mbox{diag}^{-1} \left( \mathrm B \, \mbox{diag} (\mathrm x) \, \mathrm C \, \mathrm A \right) + \mbox{diag}^{-1} \left( \mathrm C \, \mathrm A \, \mbox{diag} (\mathrm x) \, \mathrm B \right)}$$

0
On

$\def\v{{\rm vec}}\def\d{{\rm diag}}\def\D{{\rm Diag}}\def\p#1#2{\frac{\partial #1}{\partial #2}}$For typing convenience, use a colon as a product notation for the trace, i.e. $$\eqalign{ A:B = {\rm Tr}(AB^T) \;=\; \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} \\ }$$ and assign a name to the function of interest $$\eqalign{ \phi &= {\rm Tr}\left(AXBX^TC\right) \\ &= CAXB:X \\&= A^TC^TXB^T:X \\ }$$ Then the gradient that you discovered can be written as the differential relationship $$\eqalign{ d\phi &= \big(CAXB + A^TC^TXB^T\big):dX \\ }$$ Let's also carefully name the diagonal operations. The diag() function creates a vector from the diagonal of its matrix argument, while the Diag() function does the opposite - creating a diagonal matrix from a vector argument, e.g. $$\eqalign{ X = \D(x) \quad\implies\quad x = \d(X) \\ }$$ The colon product has a very interesting property with respect to these operators $$\eqalign{ A:\D(x) &= \d(A):x \\ }$$ Using all of the above, we can calculate the gradient of interest as follows $$\eqalign{ d\phi &= \big(CAXB + A^TC^TXB^T\big):\D(dx) \\ &= \d\big(CAXB + A^TC^TXB^T\big):dx \\ \p{\phi}{x} ​&= \d\big(CAXB + A^TC^TXB^T\big) \\ }$$ Substituting $A=D,\,C=D^T\,$ and $B=\Omega=\Omega^T,\,$ the gradient can be simplified to $$\eqalign{ \p{\phi}{x} ​&= \d\big(D^TDX\Omega + D^TDX\Omega\big) \\ ​&= \d\big(2\,D^TDX\Omega\big) \\ }$$ Since $\d(M^T)=\d(M)\:$ even when $M^T\ne M$, this gradient can also be written as $$\eqalign{ \p{\phi}{x} ​&= \d\big(2\,\Omega XD^TD\big) \qquad\qquad \\ }$$