Derivative of the trace of the product of a matrix and its transpose

3.7k Views Asked by At

I googled and found that the derivative of the trace of the product:

$$\frac{d}{dX} \mbox{Trace} (X^TX) = 2X$$

But I can't find:

$$\frac{d}{dX} \mbox{Trace} (XX^T)$$

I don't major in mathematics, so I don't know how to derive this. Could anyone help me out?

2

There are 2 best solutions below

1
On

$tr(X+H)=Tr((X+H)(X+H)^T)=Tr(XX^T+XH^T+HX^T+HH^T)=f(X)+Tr(XH^T+HX^T)+O(H)$ implies that the derivative is $df_X(H)=Tr(XH^T+HX^T)$.

0
On

Given a scalar function $f$ of several variables $x_1,\dots,x_n$, the differential is defined by

$$\mathrm df=\frac{\partial f}{\partial x_1}\mathrm dx_1+\dots+\frac{\partial f}{\partial x_n}\mathrm dx_n=\sum_i\frac{\partial f}{\partial x_i}\mathrm dx_i$$

This is a linear form in disguise, that can also be written

$$\mathrm df(h)=\sum_i a_ih_i$$

With $a_i=\dfrac{\partial f}{\partial x_i}$.

Now, the differential of a scalar function $f$ of the matrix $X$ (with dimensions $n\times p$) is just a linear form, in the variables $x_{ij}$. You can write

$$\mathrm df=\sum_{ij}\frac{\partial f}{\partial x_{ij}}\mathrm dx_{ij}$$

Or

$$\mathrm df(h)=\sum_{ij}\frac{\partial f}{\partial x_{ij}}h_{ij}$$

Since it's a linear form, it's possible to write $\mathrm df$ as a scalar product of two column vectors:

$$\left(\begin{matrix} \dfrac{\partial f}{\partial x_{11}}\\ \vdots\\ \dfrac{\partial f}{\partial x_{1p}}\\ \dfrac{\partial f}{\partial x_{21}}\\ \vdots\\ \dfrac{\partial f}{\partial x_{2p}}\\ \vdots\\ \dfrac{\partial f}{\partial x_{np}}\\ \end{matrix}\right) \;\;\;\mathrm{ and }\;\;\; \left(\begin{matrix} h_{11}\\ \vdots\\ h_{1p}\\ h_{21}\\ \vdots\\ h_{2p}\\ \vdots\\ h_{np} \end{matrix}\right) $$

There is a more compact way.


First, notice that given two matrices $X,Y$ with same dimensions $n\times p$,

$$\mathrm{tr}(X^TY)=\sum_{ij}x_{ij}y_{ij}$$

You can prove that by writing the general term $a_{ij}$ of the product, which has dimensions $p\times p$:

$$a_{ij}=\sum_{k=1}^n x_{ki}y_{kj}$$

Hence

$$a_{ii}=\sum_{k=1}^n x_{ki}y_{ki}=\sum_{k=1}^n x_{ki}y_{ki}$$

$$\sum_{j=1}^p a_{jj}=\sum_{j=1}^p\sum_{i=1}^n x_{ij}y_{ij}$$

That is, $\mathrm{tr}(X^TY)$ is the sum of term-by-term products of all elements of $X$ and $Y$. That's very similar to a scalar product.

Now we can write $\mathrm{d}f$ in the more compact form

$$\mathrm{d}f(H)=\mathrm{tr}(A^TH)$$

With

$$A=\left(\begin{matrix} \dfrac{\partial f}{\partial x_{11}}&\cdots&\dfrac{\partial f}{\partial x_{1p}}\\ \vdots&\ddots&\vdots\\ \dfrac{\partial f}{\partial x_{n1}}&\cdots&\dfrac{\partial f}{\partial x_{np}} \end{matrix}\right)$$

The differential of a scalar function of a matrix can always be written in this compact form, and $A$ is unique. By convention it's this $A$ we will call $\dfrac{\mathrm df}{\mathrm dX}$ (same convention as in the Matrix Cookbook). There is another convention, where the matrix derivative is instead $A^T$.

To find $A$, you can compute the partial derivatives, or use the Taylor formula for a function of several variables, at order $1$:

$$f(X+H)=f(X)+\mathrm df(H)+o(||H||)$$

Note it's a little-$o$, not a big-$O$ ($\mathrm df(H)$ is already a $O(||H||)$, so that would be meaningless with a big-$O$)), and it's the norm of $H$ inside.


Let's apply this to $f(X)=\mathrm{tr}(X^TX)$. Using the partial derivatives, and noticing that $f(X)=\sum_{ij} x_{ij}^2$, you get at once

$$\dfrac{\mathrm df}{\mathrm dX}=2X$$

Since $\mathrm{tr}(X^TX)=\mathrm{tr}(XX^T)$, the derivative is the same (it's the same function):

$$\dfrac{\mathrm d\left(XX^T\right)}{\mathrm dX}=2X$$

Or with Taylor's formula

$$f(X+H)=\mathrm{tr}\left((X+H)^T(X+H)\right)=f(X)+\mathrm{tr}(X^TH)+\mathrm{tr}(XH^T)+o(||H||)$$

But $\mathrm{tr}(XH^T)=\mathrm{tr}(X^TH)$, so

$$f(X+H)=f(X)+\mathrm{tr}(2X^TH)+o(||H||)$$

And you identify the formula $\mathrm{tr}(A^TX)$, which gives you the derivative $A$. Or you can leave this as a differential

$$\mathrm df(H)=\mathrm{tr}(2X^TH)$$


A few other examples

The trace of the square of $X$ is given by:

$$\mathrm{tr}(X^2)=\sum_i x_{ii}^2+2\sum_{i<j}x_{ij}x_{ji}$$

By computing the partial derivatives or by using Taylor's formula, you find

$$\frac{\mathrm d\left(\mathrm{tr}(X^2)\right)}{\mathrm dX}=2X^T$$

The function $f(X)=\mathrm{tr}(A^TX)$ has derivative

$$\frac{\mathrm d\left(\mathrm{tr}(A^TX)\right)}{\mathrm dX}=A$$

By using an expansion of the determinant of $X$, you can prove that

$$\frac{\mathrm d(\det X)}{\mathrm dX}=\mathrm{Com}\,X$$

Where $\mathrm{Com}\, X$ is the comatrix of $X$.

See also the Wikipedia article on matrix calculus.