Consider a loss function $$ j = \frac{1}{2}||e||^2,$$ where $e=y-t,\quad$ $y=f(x,u) \in \mathbb{R}^{n},\quad$ $u=Wx \in \mathbb{R}^{m}, \quad x\in \mathbb{R}^n$.
Now I want to find the gradient of $j$ with respect to the $m\times n$ matrix $W$, i.e. $\frac{\partial j}{\partial W}$ in order to minimize $j$.
Let us denote the Frobenius inner product $A:B = \text{tr}(A^TB)$. We can then write $$\begin{aligned} j & = \frac{1}{2}e:e, \\ dj & = e : de \\ & = (y-t):dy \\ & = (y-t):df(x,u). \end{aligned}$$
Now I got confused after this step, as I want to write $df(x,u)$ in terms of the gradient $\frac{\partial f(x,u)}{\partial u}$ and $u$, but I cannot say that $df(x,u)$ equals $\frac{\partial f(x,u)}{\partial u}:du$, since $f(x,u)$ is not a scalar!
So how do I proceed in writing out a differential of a vector ($f(x,u)$)? If I have this term, I intend to proceed with the chain rule in order to find $du$: $$ \begin{aligned} u & = Wx,\\ du & = d(Wx) = d(W) x. \end{aligned} $$ Then I would expect to end up with $df(x,u)$ being a function of the gradient $\frac{\partial f(x,u)}{\partial u}$ and $dW$ so I can find the final gradient $\frac{\partial j}{\partial W}$.
So in short, what is the correct way to apply the chain rule on $df(x,u)$ with respect to $W$, and are there some general rules concerning the frobenius inner product and the chain rule that I missed?
For typing convenience, let $$\eqalign{ G &= \frac{\partial f}{\partial u} \in {\mathbb R}^{n\times m} \cr }$$ and note the following differentials $$\eqalign{ df &= G\,du \cr du &= dW\,x \cr }$$ Then picking up where you got stuck $$\eqalign{ dj &= (y-t):df \cr &= e:df \cr &= e:G\,du \cr &= G^Te:du \cr &= G^Te:dW\,x \cr &= G^Tex^T:dW \cr \frac{\partial j}{\partial W} &= G^Tex^T \cr\cr }$$ There are rules for manipulating a mixture of Frobenius products and other products $$\eqalign{ A:BC &= B^TA:C &= AC^T:B &\,\,\,\text{ (Frobenius-Matrix)}\cr A:B\odot C &= B\odot A:C &= A\odot C:B &\,\,\,\text{ (Frobenius-Hadamard)}\cr }$$ There are also rules for Kronecker product mixtures $$\eqalign{ (A\otimes B)(X\otimes Y) &= (AX)\otimes(BY) &\,\,\,\text{ (Kronecker-Matrix)}\cr (A\otimes B):(X\otimes Y) &= (A:X)\otimes(B:Y) &\,\,\,\text{ (Kronecker-Frobenius)}\cr (A\otimes B)\odot(X\otimes Y) &= (A\odot X)\otimes(B\odot Y) &\,\,\,\text{ (Kronecker-Hadamard)}\cr }$$ and rules for functions commonly used in matrix decompositions and transforms $$\eqalign{ \def\op{\operatorname} \def\tr{\op{tr}} \def\vc{\op{vec}} \def\sym{\op{sym}} \def\skew{\op{skew}} \def\iso{\op{iso}} \def\dev{\op{dev}} \def\fft{\op{FFT}} \sym(A) &\equiv\tfrac12(A+A^T) \qquad&\skew(A)\equiv A-\sym(A) \\ \iso(A) &\equiv\left[\frac{\tr(A)}{\tr(I)}\right]I \qquad&\;\;\dev(A)\equiv A-\iso(A) \\ \\ \sym(A):B &= A:\sym(B) \qquad&\sym(A):\skew(B) = 0 \\ \skew(A):B &= A:\skew(B) \\ \iso(A):B &= A:\iso(B) \qquad&\;\;\iso(A):\dev(B) = 0 \\ \dev(A):B &= A:\dev(B) \\ \fft(A):B &= A:\fft(B) \\ A:B &= \vc(A):\vc(B) \\ }$$