Relation of trace and matrix derivatives

93 Views Asked by At

I've been trying to calculate the gradients of a neural network for backpropagation and I used chain rule which didn't work because of dimensions of the matrices. Then I found that the trace function along with differantials can be used to calculate the correct derivatives, I tried and succesfully found the correct ones as below:

$$A:B=tr(A^TB)$$ $$O = J(W_2^TW_1^TX)$$ $$\eqalign{\frac{\partial O}{\partial W_1} &= \nabla J : dO \cr &= \nabla J : dW_2^T(W_1^TX) + W_2^T(dW_1^TX + W_1^TdX) \cr &= \nabla J : dW_2^TW_1^TX + W_2^TdW_1^TX + W_2^TW_1^TdX \cr &= \nabla J : dW_2^TW_1^TX + \nabla J : W_2^TdW_1^TX + \nabla J : W_2^TW_1^TdX \cr &= \nabla JX^TW_1 : dW_2^T + W_2\nabla JX^T : dW_1^T + W_1W_2\nabla J : dX \cr &= (W_2\nabla JX^T : dW_1^T)^T \cr &= X\nabla J^TW_2^T: dW_1 \cr &= X\nabla J^TW_2^T}$$

I can calculate it but I am lost looking for an explaination . Why and how is trace related to matrix derivatives?

1

There are 1 best solutions below

5
On BEST ANSWER

A better definition of the Frobenius product is $$G:dX = \sum_{i=1}^m\sum_{j=1}^n G_{ij}\;dX_{ij}$$ The fact that this can also be written in terms of a well-known matrix function is an unfortunate coincidence.

The intent of demonstrating the equivalent trace formula is to make things seem more familiar, but it usually has the opposite effect and confuses things further.

As an antidote to this confusion, consider the analogous formula for third-order tensors $${\cal G}\,\therefore\,d{\cal X} = \sum_{i=1}^m\sum_{j=1}^n\sum_{k=1}^p {\cal G}_{ijk}\;d{\cal X}_{ijk}$$ which cannot be re-written in terms of any of the standard matrix functions, including the trace.