Let ∇A(x) denote the derivative of X with respect to the matrix A. Let X^T denote the transpose of matrix X. Then the following two rules hold.
1) ∇A (trace of AB) = B^T
2) ∇A (trace of AB A^T C) = CAB + C^T A B^T
While both rules are mathematically correct, I was wondering why they both hold.
For instance, from 1), we can say that
∇A (trace of AB A^T C) = ∇A (trace of A (B A^T C) ) = (B A^T C)^T = C^T A B^T
However, the answer is CAB + C^T A B^T
not C^T A B^T
Is there something wrong with the way I calculated it? I just used the rule 1.
Short answer: You must use the first expression on each of the two occurrences of $A$ in the second expression.
An approach from first principles is to write the differential form of the second formula $$\eqalign{ d\,{\rm tr}(ABA^TC) &= {\rm tr}(dA\,BA^TC) + {\rm tr}(AB\,dA^T\,C) \cr &= {\rm tr}(dA^T\,C^TAB^T) + {\rm tr}(dA^T\,CAB) \cr }$$ where the second line utilizes the transpositional and cyclic properties of the trace.
From the differential, the gradient is seen to be $$\eqalign{ \nabla_A\,{\rm tr}(ABA^TC) &= C^TAB^T + CAB \cr\cr }$$