How do I take the derivative of $\text{trace}(XX^TY^TYXX^T)$ with respect to the matrix $X$?

341 Views Asked by At

Given the matrices $Y$ and $X$, I am trying to compute the derivative of the function $f(X,Y) = \text{trace}(XX^TY^TYXX^T)$ with respect to $X\in\mathbb{R}^{d\times k}$. I am aware of the matrix cookbook but I can not find a way to apply any of the given formulas. Shall I compute the derivative as in here and use the direction $E=I$, where $I$ is the identity matrix? Is there an alternative way with less computations?

5

There are 5 best solutions below

0
On BEST ANSWER

I am adding this answer as this might add some additional insight into how these derivatives could be computed. Thus, we will start with some prerequesites:

Consider the mapping $$f : \mathbb{R}^{d\times k} \times \mathbb{R}^{l \times d} \to \mathbb{R} \\ X,Y \mapsto \operatorname{tr}(XX^TY^TY XX^T)$$

First, we note that by the cycling property, we can also equivalently write this as $$f(X,Y) = \operatorname{tr}(XX^TXX^TY^TY) = \langle XX^TXX^T,Y^TY\rangle,$$ where $$\langle A, B\rangle = \operatorname{tr}(AB^T)$$ is a scalarproduct on the space of $d\times d$ matrizes.

Now, what is a derivative? If we take the notion of Frechet derivative, then losely a derivative is the linear mapping that best approximates our function at a given point. That is, the derivative at $X$ is a mapping $$D_f(X): \mathbb{R}^{d\times k} \to \mathbb{R} $$ which is linear and has $$f(X+h) - f(X) - D_f(X)(h) = o(h)$$.

But now we can ask, how does a linear mapping from the space of $ d\times k$ to the reals actually look like? By Riesz, we know that such a mapping takes the form

$$D_f(X)(h) = \langle h, M(X)\rangle,$$ where $M(X)$ is a $d \times k $ matrix. Thus, we can identify the derivative with this matrix.

With this definition, it is also relatively easy to see two things: For a linear mapping, the derivative is obviously equal to the mapping itself. For a multilinear mapping $g(x_1,x_2,x_3,...,x_n)$, the derivative is obtained by componentwise applying the mapping while leaving the other components fixed, i.e. $$ D_g(x_1,x_2,x_3,...,x_n) (h_1,h_2,...,h_n) = g(h_1,x_2,x_3,...,x_n) + g(x_1,h_2,x_3,...) ...$$ This especially also applies to the scalar product.

Now we are ready to compute the derivative of our function $f$. The first thing we note, is that by the bilinearity of the scalar product and the fact that $Y^TY$ does not depend on $X$, and applying the chain rule we have

$$D_f(X)(h) = \langle D_g(X)(h),Y^TY\rangle ,$$ where $g(X) = XX^TXX^T$. Now we notice that $g$ is again a multilinear mapping in $X$ and we obtain

$$D_g(X)(h) = hX^TXX^T + (hX^TXX^T)^T + Xh^TXX^T + (Xh^TXX^T)^T.$$

In principle we would now be done calculating the derivative, in the sense that we can describe it as the linear mapping

$$ h \mapsto \langle D_g(X)(h),Y^TY \rangle.$$

However, there is one point that still bothers us here: The scalar product in the above definition is on the space of $d \times d$ matrizes, and not on the space of $d \times k$ matrizes, as we have suggested. Also, the result is not in the form $\langle h, M(X) \rangle$ yet. We can however now use the fact that a linear mapping has an adjoint operator:

$$ \langle D_g(X)(h),Y^TY \rangle = \langle h, D_g(X)^*Y^TY \rangle,$$ where $D_g(X)^*$ is precisely defined such that this equation holds and on the right side the scalar product is now computed on the space of $\mathbb{R}^{d\times k}$.

With some algebra we find that the adjoint of the mapping $D_g(X)$ is given by $$D_g(X)^*(A) = 2*(XX^TAX + A XX^T X).$$

Thus, combining all our considerations, we find that the derivative of $f$ is identified with the matrix

$$ D_f(X) = 2(XX^T Y^TY X + Y^TY XX^T X).$$

0
On

Chain Rule, using convention $AXB' = (A\otimes B)\cdot X$ and rule $\frac{{\rm d\,} }{{\rm d} X} \|\mathbf P X\|^2 = 2\mathbf P' \mathbf P X$ for 4th-order tensor $\mathbf P$.

$$\begin{aligned} \frac{{\rm d} \|YXX'\|^2}{{\rm d\,} X} &=\frac{\partial \|YXX'\|^2}{\partial (X, X')} \frac{\partial (X, X')}{\partial X} \\&= \begin{bmatrix} \frac{\partial \|YZX'\|^2}{\partial Z=X} ,& \frac{\partial \|YXZ\|^2}{\partial Z=X'} \end{bmatrix} \cdot \begin{bmatrix}\frac{\partial X}{\partial X} \\ \frac{\partial X'}{\partial X} \end{bmatrix} \\&= \begin{bmatrix} \frac{\partial \| (Y\otimes X)\cdot Z\|^2}{\partial Z=X} ,& \frac{\partial \|(YX\otimes \mathbb I)\cdot Z\|^2}{\partial Z=X'} \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \\&= 2\begin{bmatrix} (Y\otimes X)'\cdot(Y\otimes X)\cdot X ,& (YX\otimes \mathbb I)'\cdot(YX\otimes \mathbb I)\cdot X' \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \\&= 2\begin{bmatrix} (Y'Y\otimes X'X)\cdot X ,& (X'Y'Y X\otimes \mathbb I)\cdot X' \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \\&= 2\begin{bmatrix} Y'Y X X'X ,& X'Y'Y X X^T \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \\&= 2(Y'Y X X'X + XX'Y'YX) \end{aligned}$$

which coincides with result obtained from http://www.matrixcalculus.org.

0
On

$ \def\l{\left} \def\r{\right} \def\n{\nabla} \def\o{{\tt1}} \def\p{\partial} \def\lr#1{\l(#1\r)} \def\t#1{\operatorname{Tr}\lr{#1}} \def\grad#1#2{\frac{\p #1}{\p #2}} $Let's use a colon to denote the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \t{AB^T} \\ A:A &= \big\|A\big\|^2_F\\ }$$ And for typing convenience, let's define the symmetric matrix variables $${A=XX^T,\qquad B=YY^T }$$ Write the function using the above notation.
Then calculate its differential and gradient. $$\eqalign{ f &= B:A^2 \\ df &= B:(A\,dA+dA\,A) \\ &= (A^TB+BA^T):dA \\ &= (AB+BA):(dX\,X^T+X\,dX^T) \\ &= (AB+BA+B^TA^T+A^TB^T):(dX\,X^T) \\ &= 2(AB+BA)X:dX \\ \grad{f}{X} &= 2(AB+BA)X \\\\ }$$


NB: The properties of the underlying trace function allow the terms in a Frobenius product to be rearrange in variety of ways, e.g. $$\eqalign{ A:BC &= AC^T:B = B^TA:C \\ A:B &= A^T:B^T \\ A:B &= B:A \\ }$$

4
On

I have tried to reproduce an answer a user posted but finally deleted it. In the deleted answer the equations $(1)-(4)$ are given where $(1)-(3)$ are taken from the matrix cookbook.

$$\partial \text{trace}(X) = \text{trace}(\partial X)\tag{1}$$

$$\partial (XY) = \partial(X) Y + X \partial (Y) \tag{2}$$

$$\partial X^T = (\partial X)^T \tag{3}$$

Then $$\begin{equation}\begin{aligned}\partial \text{trace}(XX^T Y^TYXX^T) &= \text{trace}\left(\partial\left(XX^T Y^TYXX^T\right)\right) \\&= \text{trace}(\partial(X)X^T Y^TYXX^T) + \text{trace}(X(\partial X)^T Y^TYXX^T) \\&+ \text{trace}(XX^T Y^TY(\partial{X})X^T) + \text{trace}(XX^T Y^TYX (\partial X)^T)\end{aligned}\end{equation} \tag{4}$$

Using basic trace properties we have $$\text{trace}(\partial(X)X^T Y^TYXX^T)= \text{trace}(X^T Y^TYXX^T\partial X)$$

$$\text{trace}(X(\partial X)^T Y^TYXX^T) = \text{trace}(X^T XX^TY^TY \partial X)$$

$$\text{trace}(XX^T Y^TY(\partial{X})X^T) = \text{trace}(X^TXX^T Y^TY \partial{X})$$

$$\text{trace}(XX^T Y^TYX (\partial X)^T) = \text{trace}(X^TY^TYX X^T \partial X) $$

Then, taking a closer look at the examples in [1] from $(4)$ we get

$$\frac{\partial}{\partial X}f(X,Y) = \frac{\partial}{\partial X}\text{trace}(XX^T Y^TYXX^T) = 2\left(X^TY^TYX X^T + X^TXX^T Y^TY\right).$$

However, the final results is the transpose of the result other users provide why is that?

2
On

I felt like adding another answer to highlight how one needs to be careful when using the chain-rule in a straightforward manner. This time, let's do an additional intermediate step:

\begin{align} \frac{{\rm d} \|YXX'\|^2}{{\rm d\,} X} &=\frac{\partial \|YXX'\|^2}{\partial YXX'} \frac{\partial YXX'}{\partial (X, X')} \frac{\partial (X, X')}{\partial X} \tag{1} \\&= \begin{bmatrix}\frac{\partial \|Z\|^2}{\partial Z=YXX'}\end{bmatrix} \cdot \begin{bmatrix} \frac{\partial YZX'}{\partial Z=X} ,& \frac{\partial YXZ}{\partial Z=X'} \end{bmatrix} \cdot \begin{bmatrix}\frac{\partial X}{\partial X} \\ \frac{\partial X'}{\partial X} \end{bmatrix} \tag{2} \\&= \begin{bmatrix}2YXX'\end{bmatrix} \cdot \begin{bmatrix} \frac{\partial (Y\otimes X)\cdot Z}{\partial Z=X} ,& \frac{\partial (YX\otimes \mathbb I)\cdot Z}{\partial Z=X'} \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \tag{3} \\&= \begin{bmatrix}2YXX'\end{bmatrix} \cdot \begin{bmatrix} Y\otimes X ,& YX\otimes \mathbb I \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \tag{4} \\&= \begin{bmatrix} Y'YXX'X ,& X'Y'YXX' \end{bmatrix} \cdot \begin{bmatrix}\mathbb I \\ \mathbb T\end{bmatrix} \tag{5} \\&= 2(Y'Y X X'X + XX'Y'YX) \tag{6} \end{align}

Now, you may ask: how does the step from (4) to (5) come about?

Well, it's because we have $Y\cdot (A\otimes B)= A'YB$

But how is that consistent with $(A\otimes B)\cdot X = AXB'$? If we had $Y\cdot (A\otimes B)\cdot X$, then

$$Y\cdot (AXB') = Y\cdot \big((A\otimes B)\cdot X\big) = Y\cdot (A\otimes B)\cdot X = \big(Y\cdot (A\otimes B)\big) \cdot X = (A'YB)\cdot X$$

A contradiction! So what's going on? Is associativity broken? Well no. The point is it would be a contradiction if here "$\cdot$" would mean matrix multiplication. But it doesn't! Here, "$\cdot$" must mean a tensor-contraction over 2 axes, because $(A\otimes B)\cdot X = (A\otimes B)_{ij, kl}\cdot X_{kl}$, after contracting, has two upper/left indices.

So in order to be consistent, $Y\cdot (A\otimes B)\cdot X$ necessarily must be a scalar quantity, and indeed it translates to

$$Y\cdot (A\otimes B)\cdot X = \underbrace{\langle Y \mid (A\otimes B)\cdot X\rangle}_{=\langle Y\mid AXB'\rangle} = \underbrace{\langle (A'\otimes B') \cdot Y \mid X\rangle}_{=\langle AYB' \mid X\rangle}$$

And the apparent contradiction disappears. Now, one can try and use a colon notation "$:$" like greg likes to do, and I am sure that that is a useful thing in many circumstances, but of course it only goes so far. What is if we now need to do 3d or 4d-tensor contractions?