How to apply Chain Rule with differentials in Matrix Derivatives?

105 Views Asked by At

@Steph had kindly answered my other question, but I can't work out the math.

He said that "The correct way to apply chain rule with matrices is to use differentials", and provided the answer to $\partial E \over \partial W_4$.

OK, let's suppose that $\partial E \over \partial A_5$ is known to be $(A_5-R)$, so the answer checked out, no problem.

Now if I want to use the same approach to calculate $\partial E \over \partial W_3$, it should be

$dE={\partial E \over \partial A_5}:dA_5$

$dE=W_4^T{\partial E \over \partial A_5}:dA_4$

$dE=A_3^TW_4^T{\partial E \over \partial A_5}:dW_3$

${\partial E \over \partial W_3}=A_3^TW_4^T(A_5-R)$

The "order" is wrong!

If I want to make it right, then the $A$ has to be in the very front, and the $W$s have to be inserted in the very end for each operation.

Why is that!?

Why the same operation $(dA_5=dA_4W_4)$ will product answers in different positions?

The only "possible", if not "far-fetching", relationship I could find is: Because $A_4$ is "in front", so the answer $(A_4^T)$ will always be in the front, and because $W_4$ is "in the end", so the answer $(W_4^T)$ will always be in the very end.

Is it the right reason, or I'm just thinking too much?

Thank you very much for your help!

2

There are 2 best solutions below

0
On BEST ANSWER

$ \def\SSS{\sum_{i=1}^m\sum_{j=1}^n\sum_{k=1}^p} \def\A{A_{ij}} \def\B{B_{ik}} \def\BT{B_{ki}^T} \def\C{C_{kj}} \def\CT{C_{jk}^T} \def\LR#1{\left(#1\right)} \def\BR#1{\Big(#1\Big)} $To extend my comment above, by expanding the various products $$\eqalign{ A:\LR{BC} &= \SSS \A\BR{\B\C} \\ \LR{AC^T}:B &= \SSS \BR{\A\CT}\B \\ \LR{B^TA}:C &= \SSS \BR{\BT\A}\C \\ }$$ it is obvious that the sums on the RHS are all identical, therefore the Frobenius (aka double-dot) products appearing on the LHS are likewise identical.

This equivalence could also be arrived at by considering the properties of the trace function when its matrix argument is transposed and/or cyclically permuted.

0
On

Regarding your question, from the definition $$ dE = \frac{\partial E}{\partial \mathbf{A}_5}: d \mathbf{A}_5 $$ with the colon operator denoting the Frobenius inner product. Remember that $\mathbf{A}:\mathbf{B}= \mathrm{tr}(\mathbf{A}^T \mathbf{B}) $.

Consider the 'simple' product layer : $\mathbf{A}_5=\mathbf{A}_4 \mathbf{W}_4$, you can either obtain $$ dE = \frac{\partial E}{\partial \mathbf{A}_5}: \mathbf{A}_4 (d \mathbf{W}_4) = \mathbf{A}_4^T \frac{\partial E}{\partial \mathbf{A}_5}: d \mathbf{W}_4 $$ or $$ dE = \frac{\partial E}{\partial \mathbf{A}_5}: (d\mathbf{A}_4) \mathbf{W}_4 = \frac{\partial E}{\partial \mathbf{A}_5} \mathbf{W}_4^T: d \mathbf{A}_4 $$

Thus by identification $$ \frac{\partial E}{\partial \mathbf{W}_4} = \mathbf{A}_4^T \frac{\partial E}{\partial \mathbf{A}_5} ,\quad \frac{\partial E}{\partial \mathbf{A}_4} = \frac{\partial E}{\partial \mathbf{A}_5} \mathbf{W}_4^T $$ This gives you the backpropagated gradient of the loss function. Note how the matrices come either on the left or the right. This is why your derivation is wrong from the second line. As shown in Greg's comment, these facts are easily deduced from the trace properties.