I am trying to minimize a function A wrt W, so seeking its gradient
$$ A = \ln \det ( WW^T + \sigma^2I)$$
So according to the chain rule I found
$$ \frac{\partial A}{\partial W} = tr((\frac{\partial g(U)}{\partial U})^T \cdot \frac{\partial U}{\partial W_{ij}}) $$
Where
$$ U = WW^T \sigma^2 I $$ $$ g(U) = \ln \det U $$
I found also that
$$ \frac{\partial \ln \det U}{\partial U} = tr(U^{-1}\partial U) $$
Since $ \partial U $ wrt U should be just a matrix full of ones, call it S,
$$ \frac{\partial \ln \det U}{\partial U} = tr(U^{-1}S) $$
And also
$$ \frac{\partial U}{\partial W_{ij}} = \frac{\partial WW^T + \sigma^2 I}{\partial W_{ij}} = \frac{\partial WW^T}{\partial W_{ij}} $$
Which I found is
$$ \frac{\partial WW^T}{\partial W_{ij}} = WJ^{ji} + J^{ij} W^T $$
So putting it all together
$$ \frac{\partial A}{\partial W} = tr(tr(U^{-1}S)^T \cdot (WJ^{ji} + J^{ij} W^T)) $$
(The transpose can be dropped as our function is scalar.)
However, this result does not seem to agree with a simple numerical derivation. Why is this?
Rewrite the function in terms of the trace function and find its differential $$\eqalign{ A &= \log(\det(WW^T+\sigma^2 I)) \cr &= {\rm tr}(\log(WW^T+\sigma^2 I)) \cr\cr dA &= (WW^T+\sigma^2 I)^{-T}:d(WW^T) \cr &= (WW^T+\sigma^2 I)^{-T}:2\,{\rm sym}(dW\,W^T) \cr &= 2\,{\rm sym}(WW^T+\sigma^2 I)^{-1}:dW\,W^T \cr &= 2\,(WW^T+\sigma^2 I)^{-1}W:dW \cr }$$ Since $dA=(\frac{\partial A}{\partial W}:dW),\,$ the gradient must be $$\eqalign{ \frac{\partial A}{\partial W} &= 2\,(WW^T+\sigma^2 I)^{-1}W \cr }$$ The above derivation employs both the Frobenius (:) product and the sym() function $$\eqalign{ {\rm sym}(M) &= \frac{1}{2}(M+M^T) \cr A:M &= {\rm tr}(A^TM) \cr }$$