I've been trying to perform the following differentiation of a neural network:
$$\frac{\delta||h(XW)\alpha-y||^2}{\delta W} = \frac{\delta}{\delta W}\sum_i(h(XW)_i\alpha-y_i)^2$$
Where $X$ and $W$ and matrices, $\alpha$ and $y$ are vectors, and $h$ is a point wise applied function.
I've been reading the Wikipedia article on Matrix calculus and "The Matrix Cookbook" all day, but I can't seem to get things to work. I think it should probably be
$$2(h(XW)\alpha-y)\frac{\delta}{\delta W}(h(XW)\alpha)$$
But I certainly get stuck at the $h$ function, which I guess you could say is from matrix to matrix.
Any hints would be appreciated.
Update: I think this derivation is correct:
$$\frac{\delta||h(XW)\alpha-y||^2}{\delta W} = \frac{\delta}{\delta W}(h(XW)\alpha-y)^T(h(XW)\alpha-y) = 2(h(XW)\alpha-y)^T\frac{\delta}{\delta W}(h(XW)\alpha) = 2(h(XW)\alpha-y)^TX^Th'(XW)\alpha$$
This was derived by derivating over each element of $W$ using traces.
Update 2: I found this great presentation of the topic: Schonemann_Trace_Derivatives_Presentation.pdf which I recommend very much.
I've reformulated by defining $H=h(XW)$, $H'=h'(XW)$, $E = HA-Y$. Hence the problem has the pretty solution
$$\frac{\delta}{\delta W}||E||_F^2 = \frac{\delta}{\delta W}Tr(EE^T) = 2X^T(H' \odot EA^T)$$
Where $\odot$ is the point wise product. By working more with the trace manipulations you can also get a formula for the generalized problem $h(h(...)W_2)W_1)W_0$.
Thomas, here there are no traces. I assume that $h:M_n\rightarrow M_n$ and $||.||$ is the euclidean norm on $\mathbb{R}^n$. The derivative of $h$, in the point $XW$, is denoted by $D_{XW}h\in L(M_n,M_n)$ and is a jacobian. If $f:W\rightarrow ||h(XW)\alpha-y||^2$, then $D_{W}f:H\in M_n\rightarrow \mathbb{R}$ is a linear application. If $g:W\rightarrow h(XW)\alpha-y$, then $D_Wg:H\rightarrow D_{XW}h(XH)\alpha$. Finally the required derivative $D_Wf:H\rightarrow 2(h(XW)\alpha-y)^T(D_{XW}h)XH\alpha$ doesn't have a nice head.
$f$ is a real function and then, has a gradient that (theoretically) we can write using the matrix scalar product $(U,V)=trace(U^TV)$. Unfortunately, "trace" does not appear in the formula giving $D_Wf$. I'm afraid that you have to settle for the previous result.