In case of scalar, chain rule for $Y = f(g(X))$ can be written as
$\frac{dY}{dX} = \frac{dY}{dU}\frac{dU}{dX}$
where $U = g(X)$.
In case of $U$, and $X$ are $M \times N$, $P \times Q$ matrices respectively,
then $\frac{dY}{dU}$ and $\frac{dU}{dX}$ result in $M \times N$, $MN \times NQ$ matrices respectively.
Here I see an issue.
The above chain rule doesn't make sense,
because $M \times N$ matrix and $MN \times NQ$ matrix can be neither multiplied nor applied to dot product because of their sizes.
What am I missing or misunderstanding?
I guess division of $\frac{dY}{dU}$ and/or multiplication of $\frac{dY}{dU}$ and $\frac{dU}{dX}$
should be replaced with other operators. But I'm not sure what they're.
Note: I've read many articles regarding derivatives but AFAIK they're introducing the case of vectors($N=1$ and/or $Q=1$).
You need to use the double-dot product $$ \frac{dY}{dX} \;=\; \left(\frac{dY}{dU}\right):\left(\frac{dU}{dX}\right) \\ $$ which can also be written using index notation $$\frac{dY_{ij}}{dX_{k\ell}} \;=\; \sum_{m=1}^M\sum_{n=1}^N\left(\frac{dY_{ij}}{dU_{mn}}\right) \left(\frac{dU_{mn}}{dX_{k\ell}}\right) \\ $$ NB: $\;$ All of the gradients above are fourth-order tensors.