We have three matrices $\mathbf{W_2}$, $\mathbf{W_1}$ and $\mathbf{h}$ (technically a column vector):
$$ \mathbf{W_1} = \begin{bmatrix} a & b \\ c & d \\ \end{bmatrix} \;\;\;\;\;\;\;\;\; \mathbf{W_2} = \begin{bmatrix} e & f \\ \end{bmatrix} \;\;\;\;\;\;\;\;\; \mathbf{h} = \begin{bmatrix} h_1 \\ h_2 \\ \end{bmatrix} $$
And a scalar $y$, where:
$$ y = \mathbf{W_2} \mathbf{W_1} \mathbf{h} $$
I'd like to compute the derivative of $y$ with respect to $\mathbf{W_1}$, assuming numerator layout.
Using the chain rule:
$$ y = \mathbf{W_2} \mathbf{u} \;\;\;\;\;\;\;\;\; \mathbf{u} = \mathbf{W_1} \mathbf{h} \\ $$
$$ \begin{align} \frac{\partial y}{\partial \mathbf{W_1}} &= \frac{\partial y}{\partial \mathbf{u}} \frac{\partial \mathbf{u}}{\partial \mathbf{W_1}} \\ &= \mathbf{W_2} \frac{\partial \mathbf{u}}{\partial \mathbf{W_1}} \\ &= \mathbf{W_2} \mathbf{h}^{\top} \\ \end{align} $$
All well and good. Except - this isn't a $2x2$ matrix!! In fact, the dimensions don't match up for matrix multiplication, so something must be incorrect.
If we take the Wikipedia definition of the derivative of a scalar by a matrix, using numerator layout, we know that actually:
$$ \frac{\partial y}{\partial \mathbf{W_1}} = \begin{bmatrix} \frac{\partial y}{\partial a} & \frac{\partial y}{\partial c} \\ \frac{\partial y}{\partial b} & \frac{\partial y}{\partial d} \\ \end{bmatrix} $$
Each element is just a scalar derivative, which we can calculate without any matric calculus. If we do that by hand and then factorise, we end up with:
$$ \frac{\partial y}{\partial \mathbf{W_1}} = \mathbf{h} \mathbf{W_2} $$
Clearly, $\mathbf{h} \mathbf{W_2} \neq \mathbf{W_2} \mathbf{h}^\top $.
Can anybody suggest where I went wrong?
For the ∂u/∂W1, u is 2x1 vector and W1 is 2x2 matrix. So ∂u/∂W1 is a 2x2x2 tensor and not h⊤.
Ref: https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions
Notice that we could also talk about the derivative of a vector with respect to a matrix, or any of the other unfilled cells in our table. However, these derivatives are most naturally organized in a tensor of rank higher than 2, so that they do not fit neatly into a matrix