Help with differentiation chain rule with tensors in backpropagation

299 Views Asked by At

Say, we're given $N$ feature vectors $\mathbf{x}_i \in \mathbb{R}^{D \times 1}$ and assembled into a matrix $X \in \mathbb{R}^{D \times N}$. We also have a matrix $W \in \mathbb{R}^{D \times D}$, $W = XX^\top$ and a predictor matrix $Y \in \mathbb{R}^{D \times N}$, $Y=WX$. Say, we have a scalar function, e.g., $\ell = \left\lVert Y \right\lVert_F$. We need to compute the gradients of $\ell$ using back-propagation.

We know that $\frac{\partial \ell}{\partial Y}$ is a matrix. $\frac{\partial \ell}{\partial W}$ is also a matrix and $\frac{\partial \ell}{\partial W} = \frac{\partial \ell}{\partial Y} \frac{\partial Y}{\partial W}$ (by the chain rule).

Here's the problem: $\frac{\partial Y}{\partial W}$ is a 4-tensor and multiplying a matrix by a 4-tensor gives a 4-tensor (albeit, potentially of a different shape), NOT a matrix (as $\frac{\partial \ell}{\partial W}$ should be)!

Obviously, I'm doing something wrong. The question is - what?

Thx

2

There are 2 best solutions below

2
On BEST ANSWER

The kind of "multiplication" used here is $\frac{\partial\ell}{\partial W_{ab}}=\sum_{cd}\frac{\partial\ell}{\partial Y_{cd}}\frac{\partial Y_{cd}}{\partial W_{ab}}$, where nobody writes the $\sum_{cd}$.

0
On

$ \def\l{\lambda}\def\o{{\tt1}}\def\p{\partial} \def\B{\Big}\def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\B(#1\B)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\iff\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\b#1{\color{blue}{#1}} $By using$\ ^\dagger$Frobenius products and differentials, one can calculate the gradient of the scalar function as follows (without the need for those intermediate $4$-tensors) $$\eqalign{ \ell &= Y:Y \\ d\ell &= 2Y:dY \\ &= 2Y:\BR{\c{dW\,X} + \b{W\,dX}} \\ }$$ Now consider the case where $X$ is constant, and therefore $\,\b{dX=0}$ $$\eqalign{ d\ell &= 2Y:\c{dW\,X} \\ &= 2YX^T:dW \\ \grad{\ell}{W} &= 2YX^T \\ }$$ Similarly, when $W$ is constant, $\,\c{dW=0}\,$ and $$\eqalign{ d\ell &= 2Y:\b{W\,dX} \\ &= 2W^TY:dX \\ \grad{\ell}{X} &= 2W^TY \\\\ }$$


$\ ^\dagger\ $ The Frobenius product is a concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ G &= \grad{\ell}{A} \qiq d\ell = G:dA \\ }$$ This is also called the double-dot or double-contraction product.
When applied to vectors $(n=\o)$ it reduces to the standard dot product.

The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in several different but equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A \;=\; A^TC:B \\ }$$