The above equation appears without proof on page 9 (equation 3) of Andrew Ng's notes on Machine Learning
I have tried various approaches to prove this to no avail. From the notes it seems that it should be provable from first principles. I tried using the ordinary chain rule on an element by element basis, but this quickly gets unwieldy. Any hint on the approach to resolve this would help a lot.
The matrix cookbook in Section 2.4 gives rules for differentiation of the trace of a matrix. Use those in conjunction with the fact that trace doesn't change for cyclic permutation of the matrices i.e. tr(ABCD)=tr(DABD) to derive the above expression.