I am trying to understand how a technique of efficient sparse coding works. I am in the stage of the dictionary training, having the following problem:
$$ \text{minimize } \|X-BS\|^2_F\\ \text{subject to } \sum_{i=1}^k B^2_{i,j}\le c,\text{ for } j=1,\ldots,n $$
with $X\in \mathbb{R}^n, B\in \mathbb{R}^{n\times m}$ and $S\in\mathbb{R}^m$ .
I have managed to extract the Lagrange dual:
$$ D(\vec{λ}) = \operatorname{tr}(X^TX-XS^T(SS^T+Λ)^{-1}(XS^T)^T-cΛ) \tag{1} \label{1} $$
where $Λ=\operatorname{diag}(\vec{λ})$ and $\vec{λ}$ is the vector of the Lagrange coefficients. In the paper, which I am trying to understand, it is said that:
$$ \frac{\partial D(\vec{λ})}{\partial λ_i}=\|XS^T(SS^T+Λ)^{-1}e_i\|^2-c $$
where $e_i$ is the $i$-th unit vector. I have a problem to understand how it is possible to get this result, particularly while differentiating the term $\operatorname{tr}(XS^T(SS^T+Λ)^{-1}(XS^T)^T)$. So, by the identity of inverse matrices I get the result: $$ \frac{\partial (SS^T+Λ)^{-1}}{\partial λ_i}=-(SS^T+Λ)^{-1}E_i(SS^T+Λ)^{-1} \tag{2} \label{2} $$ where $E_i$ is the identity matrix with all columns zeroed, apart from the i-th one. Plugging this result to \eqref{1} I do the following:
$$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} = {} & \operatorname{tr}(XS^T(SS^T + Λ)^{-1} E_i(SS^T+Λ)^{-1}(XS^T)^T)-c =\\ & \operatorname{tr}((SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}E_i)-c \end{split} $$
By viewing that the inverse term is symmetric, I get: $$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} =& \operatorname{tr}(\|XS^T(SS^T+Λ)^{-1}\|^2E_i)-c=\\ & \|XS^T(SS^T+Λ)^{-1}\|^2-c \end{split} $$
This result differs from the one reported. What is wrong with my proof? Thank you all for your answers.
Edit
I add the link to the paper for anyone interested, which might help with answering my question: https://web.eecs.umich.edu/~honglak/nips06-sparsecoding.pdf
I found out the problem. $E_i$ can be written as $e_i {e_i}^T$. However I cannot find anything wrong with my proof above, yet those results seem to be different. If anyone can point out what is wrong, he will be awarded best answer and a big silent 'Thanks'.
Edit
Alright, what I was missing was so obvious. After the plugging in of (2), I misjudged the dimensions of the resulting product. Breaking down dimensions I have: $$ (SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}E_i\\ (m\times m)(m\times n)(n\times m)(m\times m)(m\times m)=m\times m \text{ matrix} $$ The result does not belong to $\mathbb{R}$ , thus the next step was not possible. However, if I rotate differently the quantity inside trace function, such action becomes possible. Indeed: $$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} = {} & \operatorname{tr}(XS^T(SS^T + Λ)^{-1} e_ie_i^T(SS^T+Λ)^{-1}(XS^T)^T)-c =\\ & \operatorname{tr}(e_i^T(SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}e_i)-c \end{split} $$ Now, by looking at the new terms-limits of the quantity inside trace function, it becomes apparent that it has the right dimension for the following to take place: $$ \begin{split} \frac{\partial D(\vec{λ})}{\partial λ_i} =& \operatorname{tr}(e_i^T(SS^T+Λ)^{-1}(XS^T)^TXS^T(SS^T+Λ)^{-1}e_i)-c =\\ & \operatorname{tr}((XS^T(SS^T+Λ)^{-1}e_i)^T(XS^T(SS^T+Λ)^{-1}e_i))-c =\\ & \operatorname{tr} (\|XS^T(SS^T+Λ)^{-1}e_i\|^2)-c=\|XS^T(SS^T+Λ)^{-1}e_i\|^2-c \end{split} $$