I have the following net $z=Wa$, where $z$ is a column vector 2X1, $a$ is a column vector of 3X1 and $W$ is a weight matrix with dimensions 2X3, and a loss function $l=L(z)$ where l is a scalar.
I'm now trying to calculate $\frac{dL}{dW}$ = the derivative of the loss w.r.t $W$, and I'm expecting the dimensions of the result to be 2X3. How do I do that using the generalized chain rule?
- I calculated $\frac{dL}{dz}$, which is a
1X2row vector. - I calculated $\frac{dz}{dW}$, which is a 3-d array, see below
My professor told us without proof that $\frac{dz}{dW}$ is simply $a$ as a 1x3 row vector!!!, so we can calculate the derivative using a simple outer product: $\frac{dL}{dz}^T\frac{dz}{dW}$, which indeed works out, and looks nice just like the chain rule - but why is that correct? how come my 3-d array below amount simply to $a$?
Looking for a formal explanation, along the lines of vector-derivatives

$\def\p{\partial}$ First of all you are correct: the quantity $\left(\frac{\p z}{\p W}\right)$ is definitely a third-order tensor. The field of machine learning is full of folks (even professors) who use atrocious mathematics.
Second (as you have discovered), it is very difficult to apply the chain rule in matrix calculus because it requires intermediate quantities which are third and fourth order tensors. No one cares about these intermediate quantities, because the final result is always a scalar, vector, or matrix.
A simpler approach is to use differentials, because the differential of a matrix has the same dimensionality as the matrix, and it obeys all of the rules of matrix algebra.
One piece of notation that makes things less error prone (especially transpose errors) is the trace/Frobenius product (denoted by a colon) between two equally-sized matrices, i.e. $$A:B = {\rm tr}(A^TB) = \sum_i\sum_j A_{ij}B_{ij}$$ The properties of the trace allow such products to be rearranged in many equivalent ways, e.g. $$\eqalign{ A:B &= B:A = B^T:A^T \\ CA:B &= A:C^TB = C:BA^T \\ }$$ This product also works for vectors (simply think of them as rectangular matrices) $$a:b = {\rm tr}(a^Tb) = a^Tb \\$$ Assume we have a quadratic loss function $$L = \tfrac 12 (z-\hat z):(z-\hat z)$$ and wish to calculate its gradient with respect to $W$. Start by calculating the differential $$\eqalign{ dL &= \tfrac 12d(z-\hat z):(z-\hat z)+\tfrac 12(z-\hat z):d(z-\hat z) \\ &= (z-\hat z):d(z-\hat z) \\ &= (z-\hat z):dz \\ &= (z-\hat z):dW\,a \\ &= (z-\hat z)a^T:dW \\ }$$ Now the gradient is easily identified as $$\eqalign{ \frac{\p L}{\p W} &= (z-\hat z)a^T \\ }$$ But now you understand the origin of the $a^T$ term. It comes from rearranging a Frobenius product. It is definitely not equal to $\left(\frac{\p z}{\p W}\right)^T$
Notice that this gradient has the same dimensions as the $W$ matrix, exactly as you anticipated. This is due to the presence of the $dW$ factor on the RHS of the Frobenius product, which has the same dimensions as $W$.
The standard text for this stuff is Matrix Differential Calculus by Magnus & Neudecker.