Derivative of trace of matrix expression with respect to a matrix

54 Views Asked by At

In exercise 4.2 of Pattern Recognition and Machine Learning the official solution states that the derivative of $$E_D(\mathbf{\tilde W}) = \frac{1}{2}Tr\{(\mathbf{XW}+\mathbf{1}w_0^T - \mathbf{T})^T(\mathbf{XW}+\mathbf{1}w_0^T - \mathbf{T}) \}$$ with respect to $w_0$ (where $w_0$ is a column vector of bias weights and $\mathbf{1}$ is a column vector of N ones) is $$ 2Nw_0+2(\mathbf{XW-T})^T\mathbf{1} $$ but I do not know how they got to this result. Can you please explain how to get to this result as well as a link to a resource which explains this.

1

There are 1 best solutions below

0
On BEST ANSWER

For typing convenience, define the auxiliary matrix $A$ as
$$\eqalign{ A= {\tt1}w_0^T + (XW -T) \\ A^T= w_0{\tt1}^T + (XW -T)^T }$$ and use a colon as an infix product notation for the trace function, i.e. $$A:B = {\rm Tr}(A^TB) = {\rm Tr}(B^TA) = B:A$$ Rewrite the objective function in a form which makes it easy to calculate the gradient $$\eqalign{ \phi &= \tfrac 12A:A = \tfrac 12A^T:A^T \\ d\phi &= A^T:dA^T = A^T:dw_0\,{\tt1}^T = A^T{\tt1}:dw_0 \\ \frac{\partial\phi}{\partial w_0} &= A^T{\tt1} = Nw_0 + (XW-T)^T{\tt1} \\ }$$ This is the same result as your notes except for the factor of ${\tt2}$, which I suspect is a typo.

The standard reference for this subject is Matrix Differential Calculus by Magnus and Neudecker.