Actual operator used in chain rule for matrix derivative

57 Views Asked by At

In case of scalar, chain rule for $Y = f(g(X))$ can be written as
$\frac{dY}{dX} = \frac{dY}{dU}\frac{dU}{dX}$
where $U = g(X)$.

In case of $U$, and $X$ are $M \times N$, $P \times Q$ matrices respectively,
then $\frac{dY}{dU}$ and $\frac{dU}{dX}$ result in $M \times N$, $MN \times NQ$ matrices respectively.
Here I see an issue.
The above chain rule doesn't make sense,
because $M \times N$ matrix and $MN \times NQ$ matrix can be neither multiplied nor applied to dot product because of their sizes.

What am I missing or misunderstanding? I guess division of $\frac{dY}{dU}$ and/or multiplication of $\frac{dY}{dU}$ and $\frac{dU}{dX}$
should be replaced with other operators. But I'm not sure what they're.

Note: I've read many articles regarding derivatives but AFAIK they're introducing the case of vectors($N=1$ and/or $Q=1$).

1

There are 1 best solutions below

6
On

You need to use the double-dot product $$ \frac{dY}{dX} \;​=\; ​\left(\frac{dY}{dU}\right):\left(\frac{dU}{dX}\right) \\ $$ which can also be written using index notation $$\frac{dY_{ij}}{dX_{k\ell}} \;​=\; \sum_{m=1}^M\sum_{n=1}^N​\left(\frac{dY_{ij}}{dU_{mn}}\right) \left(\frac{dU_{mn}}{dX_{k\ell}}\right) \\ $$ NB: $\;$ All of the gradients above are fourth-order tensors.