How to differentiate the trace of a matrix times its diagonal

92 Views Asked by At

Let $\mathbf{\Theta}\in\mathbb{R}^{p\times p}$ be a matrix and denote $\mbox{diag}(\mathbf{\Theta})\in\mathbb{R}^{p\times p}$ the matrix that has the same diagonal as $\mathbf{\Theta}$ and every off-diagonal element zero. I am trying to calculate

$$\frac{\partial \|\mathbf{X}\,[\mathbf{I}-\,(\mathbf{\Theta}-\mbox{diag}(\mathbf{\Theta}))]\,\|_{F}^{2} }{\partial \mathbf{\Theta}}$$

where $\|\cdot\|_{F}$ denotes the Frobenius norm, $\mathbf{I}$ the identity matrix and $\mathbf{X} \in \mathbb{R}^{n \times p}$.

The frobenius norm is equal to \begin{align*} &tr(\mathbf{X}^{\intercal}\mathbf{X})+tr(\mathbf{\Theta}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta})+tr(diag(\mathbf{\Theta})\mathbf{X}^{\intercal}\mathbf{X}diag(\mathbf{\Theta})\\ &-2tr(\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta})+2tr(\mathbf{X}^{\intercal}\mathbf{X}diag(\mathbf{\Theta}))-2tr(diag(\mathbf{\Theta})\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta}) \end{align*}

I have also worked out the derivatives to be \begin{align*} &\frac{\partial tr(\mathbf{\Theta}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta})}{\partial\mathbf{\Theta}}=2\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta}, \frac{\partial tr(diag(\mathbf{\Theta})\mathbf{X}^{\intercal}\mathbf{X}diag(\mathbf{\Theta})}{\partial\mathbf{\Theta}}=2diag(\mathbf{X}^{\intercal}\mathbf{X})diag(\mathbf{\Theta})\\ &\frac{\partial tr(\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta})}{\partial\mathbf{\Theta}}=\mathbf{X}^{\intercal}\mathbf{X},\frac{\partial tr(\mathbf{X}^{\intercal}\mathbf{X}diag(\mathbf{\Theta}))}{\partial \mathbf{\Theta}}=diag(\mathbf{X}^{\intercal}\mathbf{X}),\\ &\frac{\partial tr(diag(\mathbf{\Theta})\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta})}{\partial\mathbf{\Theta}}=(\mathbf{X}^{\intercal}\mathbf{X})diag(\mathbf{\Theta})+diag(\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta}). \end{align*}

But when I replace I get \begin{align*} \frac{\partial ||\mathbf{X}\,[\mathbf{I}-\,(\mathbf{\Theta}-diag(\mathbf{\Theta}))]\,||_{F}^{2} }{\partial \mathbf{\Theta}}=2\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta}-2diag(\mathbf{X}^{\intercal}\mathbf{X}\mathbf{\Theta})+2diag(\mathbf{X}^{\intercal}\mathbf{X})-2\mathbf{X}^{\intercal}\mathbf{X}, \end{align*} which I think is wrong because the right hand side includes components from the diagonal of $\mathbf{\Theta}$ while the left hand side does not.

As I am not very good with matrix calculus, I would appreciate any intuition. Thank you.

1

There are 1 best solutions below

1
On BEST ANSWER

Use the identity matrix $I$ and the all-ones matrix $J$ to define the off-diagonal matrix $$F = J-I$$ For typing convenience, define the matrices $$\eqalign{ A &= \Theta \\B &= X(F\odot A)-X \\ }$$ and use a colon to denote the trace/Frobenius product, i.e. $$M:N = {\rm Tr}(M^TN) = {\rm Tr}(MN^T)$$ Write the cost function using the new notation and calculate its gradient $$\eqalign{ \psi &= B:B \\ d\psi &= 2B:dB \\ &= 2B:X(F\odot dA) \\ &= 2\Big((X^TB)\odot F\Big):dA \\ \frac{\partial\psi}{\partial A} &= 2(X^TB)\odot F \\ }$$


Some of the steps above utilized the cyclic property of the Frobenius product, e.g. $$A:BC = B^TA:C = AC^T:B = etc$$ its relationship to the Frobenius norm $$\|B\|^2_F = B:B$$ and the fact that it commutes with the Hadamard product $$A:(B\odot C) = (A\odot B):C$$