Differentiating the trace of $X^T A X$

1k Views Asked by At

I'm solving an optimization problem whose Lagrangian is:

$$\mathcal L(X) = \operatorname{trace}(X^TAX)-\operatorname{trace}(\Lambda(cX^TX-I))$$

where $\Lambda$ is a diagonal matrix with the Lagrange multipliers.

I want to take $\frac{\partial \mathcal L(X)}{\partial X}=0$ and solve, but I don't know how to differentiate the trace. The trace operator being used in this way is quite foreign to me; how does one go about (1) finding the derivative and (2) solving for values of a matrix-valued function?

3

There are 3 best solutions below

0
On BEST ANSWER

You can use this guide - Practical Guide to Matrix Calculus for Deep Learning and the fact $$\text{trace}(A^TB)=A\cdot B=\sum_{i,j}A_{i,j}B_{i,j}$$ Now using the rules [mostly (6), but also (8), (9) (10) and so on] of matrix differential calculus from the guide we can find that the differential is (assuming $c$ is scalar) $$d\mathcal L(X)=d(AT\cdot X-\Lambda^T\cdot cX^TX-I\cdot\Lambda)=\\=AT\cdot dX-\Lambda^T\cdot cd(X^TX)=AT\cdot dX-\Lambda^T\cdot(dX^TcX+cX^TdX)=\\=AT\cdot dX-\Lambda^TX^Tc\cdot dX^T-cX\Lambda^T\cdot dX=AT\cdot dX-X\Lambda c\cdot dX-X\Lambda^Tc\cdot dX=\\=[AT-Xc(\Lambda+\Lambda^T)]\cdot dX$$ Using the rule (17) from the guide we see that the needed gradient is $$\frac{\partial \mathcal L(X)}{\partial X}=AT-Xc(\Lambda+\Lambda^T)$$ Equating to zero we get $$AT-Xc(\Lambda+\Lambda^T)=0$$ $$X(\Lambda+\Lambda^T)=\frac{1}{c}AT$$ Now solutions to this equation depends on $(\Lambda+\Lambda^T)$ - if it is invertible we can set $$X=\frac{1}{c}AT(\Lambda+\Lambda^T)^{-1}$$

0
On

Computing gradients is delightfully easy when we use the equation $$ \tag{1} f(X+ \Delta X) \approx f(X) + \langle \nabla f(X), \Delta X\rangle. $$

Define $f$ by $$ f(X) = \langle X, AX\rangle = \text{trace}(X^T AX). $$ If $\Delta X$ is a small matrix, then \begin{align} f(X+ \Delta X) &= f(X) + \langle X, A \Delta X\rangle + \langle \Delta X, AX\rangle + \underbrace{ \langle \Delta X, A \Delta X\rangle}_{\text{negligible}} \\ &\approx f(X) + \langle X, A \Delta X\rangle + \langle \Delta X, AX\rangle \\ &= f(X) + \langle (A^T + A)X , \Delta X\rangle. \end{align} Comparing this with equation (1), we discover that $$ \nabla f(X) = (A^T + A) X. $$

0
On

Denote the trace/Frobenius product with a colon, i.e. $\;P:Q={\rm Tr}(P^TQ)$
The properties of the trace allow terms in a Frobenius product to be rearranged, e.g. $$\eqalign{ &P^T:Q^T &= P:Q \;= Q:P \\ &P:QR &= PR^T:Q = Q^TP:R \\ }$$ Define the matrix $$B = \tfrac{1}{c} A \quad\implies A = cB$$ Write the Lagragian in terms of this new variable.
Then calculate its differential and gradients. $$\eqalign{ {\cal L} &= cB:XX^T - c\Lambda:X^TX + I:\Lambda \\ d{\cal L} &= cB:(X\,dX^T+dX\,X^T) - c\Lambda:(X^TdX+dX^TX) + (I-cX^TX):d\Lambda \\ &= c(B+B^T):dX\,X^T - c(\Lambda+\Lambda^T):X^TdX + (I-cX^TX):d\Lambda \\ &= 2c(BX-X\Lambda):dX \quad+\; (I-cX^TX):d\Lambda \\ \frac{\partial{\cal L}}{\partial X} &= 2c(BX-X\Lambda),\quad \frac{\partial{\cal L}}{\partial \Lambda} = (I-cX^TX) \\ }$$ The second gradient simply recovers the constraint, while setting the first gradient to zero results in an eigenvalue equation. $$\eqalign{ BX = X\Lambda \\ }$$ where the columns of $X$ are the eigenvectors of $B$ and the elements on the diagonal of $\Lambda$ are the associated eigenvalues. Since $B$ is symmetric, the $X$ matrix is orthogonal.

Without loss of generality, the above assumes that $(A,B)$ are symmetric matrices. And since $\Lambda$ is diagonal, it is also symmetric.