Matrix trace partial derivative

221 Views Asked by At

I am trying to understand how a technique of efficient sparse coding works. I am in the stage of the dictionary training, having the following problem:

$$ \text{minimize } \|X-BS\|^2_F\\ \text{subject to } \sum_{i=1}^k B^2_{i,j}\le c,\text{ for } j=1,\ldots,n $$

with $X\in \mathbb{R}^n, B\in \mathbb{R}^{n\times m}$ and $S\in\mathbb{R}^m$ .

I have managed to extract the Lagrange dual:

$$ D(\vec{λ}) = \operatorname{tr}(X^TX-XS^T(SS^T+Λ)^{-1}(XS^T)^T-cΛ) \tag{1} \label{1} $$

where $Λ=\operatorname{diag}(\vec{λ})$ and $\vec{λ}$ is the vector of the Lagrange coefficients. In the paper, which I am trying to understand, it is said that:

$$ \frac{\partial D(\vec{λ})}{\partial λ_i}=\|XS^T(SS^T+Λ)^{-1}e_i\|^2-c $$

where $e_i$ is the $i$-th unit vector. I have a problem to understand how it is possible to get this result, particularly while differentiating the term $\operatorname{tr}(XS^T(SS^T+Λ)^{-1}(XS^T)^T)$. So, by the identity of inverse matrices I get the result: $$ \frac{\partial (SS^T+Λ)^{-1}}{\partial λ_i}=-(SS^T+Λ)^{-1}E_i(SS^T+Λ)^{-1} \tag{2} \label{2} $$ where $E_i$ is the identity matrix with all columns zeroed, apart from the i-th one. Plugging this result to \eqref{1} I do the following:

$$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} = {} & \operatorname{tr}(XS^T(SS^T + Λ)^{-1} E_i(SS^T+Λ)^{-1}(XS^T)^T)-c =\\ & \operatorname{tr}((SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}E_i)-c \end{split} $$

By viewing that the inverse term is symmetric, I get: $$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} =& \operatorname{tr}(\|XS^T(SS^T+Λ)^{-1}\|^2E_i)-c=\\ & \|XS^T(SS^T+Λ)^{-1}\|^2-c \end{split} $$

This result differs from the one reported. What is wrong with my proof? Thank you all for your answers.

Edit

I add the link to the paper for anyone interested, which might help with answering my question: https://web.eecs.umich.edu/~honglak/nips06-sparsecoding.pdf

2

There are 2 best solutions below

0
On BEST ANSWER

I found out the problem. $E_i$ can be written as $e_i {e_i}^T$. However I cannot find anything wrong with my proof above, yet those results seem to be different. If anyone can point out what is wrong, he will be awarded best answer and a big silent 'Thanks'.

Edit

Alright, what I was missing was so obvious. After the plugging in of (2), I misjudged the dimensions of the resulting product. Breaking down dimensions I have: $$ (SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}E_i\\ (m\times m)(m\times n)(n\times m)(m\times m)(m\times m)=m\times m \text{ matrix} $$ The result does not belong to $\mathbb{R}$ , thus the next step was not possible. However, if I rotate differently the quantity inside trace function, such action becomes possible. Indeed: $$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} = {} & \operatorname{tr}(XS^T(SS^T + Λ)^{-1} e_ie_i^T(SS^T+Λ)^{-1}(XS^T)^T)-c =\\ & \operatorname{tr}(e_i^T(SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}e_i)-c \end{split} $$ Now, by looking at the new terms-limits of the quantity inside trace function, it becomes apparent that it has the right dimension for the following to take place: $$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} =& \operatorname{tr}(e_i^T(SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}e_i)-c =\\ & \operatorname{tr}((XS^T(SS^T+Λ)^{-1}e_i)^T(XS^T(SS^T+Λ)^{-1}e_i))-c =\\ & \operatorname{tr} (\|XS^T(SS^T+Λ)^{-1}e_i\|^2)-c=\|XS^T(SS^T+Λ)^{-1}e_i\|^2-c \end{split} $$

0
On

$\def\p#1#2{\frac{\partial #1}{\partial #2}}\def\d{{\rm diag}}\def\D{{\rm Diag}}\def\o{{\large\tt1}}$For typing convenience, define the variables $$\eqalign{ L &= \Lambda = \D(\lambda) \quad&&\big({\rm Lagrange\,coeffs}\big) \\ M &= \left(SS^T+L\right)^{-1} \quad&\implies\quad &dM=-M\,dL\,M \\ A &= XS^TM \\ a_k &= Ae_k \quad&&\big(k^{th}\,{\rm column\,of\,}A\big) \\ \o&= \sum_{k-1}^n\,e_k \quad&&\big({\rm all\,ones\,vector}\big) \\ }$$ and use a colon as a product notation for the trace, i.e. $$\eqalign{ A:B &= {\rm Tr}(AB^T) \;=\; \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} \\ A:A &= \big\|A\big\|^2_F \\ }$$ Note that when $\{A,B\}$ are vectors, this corresponds to an ordinary dot product.

Write the dual function in terms of the above.
Then calculate its differential and gradient $$\eqalign{ {\cal D} &= X:X - (XS^T)^TXS^T:M - cI:L \\ d{\cal D} &= 0 - (XS^T)^TXS^T:dM - cI:dL \\ &= (XS^T)^TXS^T:M\,dL\,M - cI:dL \\ &= \Big(M(XS^T)^TXS^TM - cI\Big):dL \\ &= \Big(A^TA - cI\Big):\D(d\lambda) \\ &= \d\Big(A^TA - cI\Big):d\lambda \\ &= \Big((A\odot A)^T\o - c\o\Big):d\lambda \\ g=\p{{\cal D}}{\lambda} &= (A\odot A)^T\o - c\o \qquad\qquad\big({\rm gradient\,vector}\big) \\ }$$ from which the $k^{th}$ component can be extracted as follows $$\eqalign{ g_k &= e_k^Tg \\ &= (a_k\odot a_k)^T\o - c(e_k^T\o) \\ &= (a_k\odot a_k):\o - c \\ &= a_k:(a_k\odot\o) - c \\ &= a_k:a_k - c \\ &= \|a_k\|^2 - c \\ }$$ which is the desired result.


In the above, $\odot$ denotes the elementwise/Hadamard product, which has the pleasant property of commuting with the colon product $$(A\odot B):C = A:(B\odot C) \;=\; \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} C_{ij}$$ The all-ones matrix acts as the identity for the Hadamard product.