Differentiate vector norm by matrix

991 Views Asked by At

I've been trying to perform the following differentiation of a neural network:

$$\frac{\delta||h(XW)\alpha-y||^2}{\delta W} = \frac{\delta}{\delta W}\sum_i(h(XW)_i\alpha-y_i)^2$$

Where $X$ and $W$ and matrices, $\alpha$ and $y$ are vectors, and $h$ is a point wise applied function.

I've been reading the Wikipedia article on Matrix calculus and "The Matrix Cookbook" all day, but I can't seem to get things to work. I think it should probably be

$$2(h(XW)\alpha-y)\frac{\delta}{\delta W}(h(XW)\alpha)$$

But I certainly get stuck at the $h$ function, which I guess you could say is from matrix to matrix.

Any hints would be appreciated.

Update: I think this derivation is correct:

$$\frac{\delta||h(XW)\alpha-y||^2}{\delta W} = \frac{\delta}{\delta W}(h(XW)\alpha-y)^T(h(XW)\alpha-y) = 2(h(XW)\alpha-y)^T\frac{\delta}{\delta W}(h(XW)\alpha) = 2(h(XW)\alpha-y)^TX^Th'(XW)\alpha$$

This was derived by derivating over each element of $W$ using traces.


Update 2: I found this great presentation of the topic: Schonemann_Trace_Derivatives_Presentation.pdf which I recommend very much.

I've reformulated by defining $H=h(XW)$, $H'=h'(XW)$, $E = HA-Y$. Hence the problem has the pretty solution

$$\frac{\delta}{\delta W}||E||_F^2 = \frac{\delta}{\delta W}Tr(EE^T) = 2X^T(H' \odot EA^T)$$

Where $\odot$ is the point wise product. By working more with the trace manipulations you can also get a formula for the generalized problem $h(h(...)W_2)W_1)W_0$.

2

There are 2 best solutions below

0
On

Thomas, here there are no traces. I assume that $h:M_n\rightarrow M_n$ and $||.||$ is the euclidean norm on $\mathbb{R}^n$. The derivative of $h$, in the point $XW$, is denoted by $D_{XW}h\in L(M_n,M_n)$ and is a jacobian. If $f:W\rightarrow ||h(XW)\alpha-y||^2$, then $D_{W}f:H\in M_n\rightarrow \mathbb{R}$ is a linear application. If $g:W\rightarrow h(XW)\alpha-y$, then $D_Wg:H\rightarrow D_{XW}h(XH)\alpha$. Finally the required derivative $D_Wf:H\rightarrow 2(h(XW)\alpha-y)^T(D_{XW}h)XH\alpha$ doesn't have a nice head.

$f$ is a real function and then, has a gradient that (theoretically) we can write using the matrix scalar product $(U,V)=trace(U^TV)$. Unfortunately, "trace" does not appear in the formula giving $D_Wf$. I'm afraid that you have to settle for the previous result.

1
On

In its second updated form, the function and its differential can be written in terms of the Frobenius product as $$ \eqalign { f &= E:E \cr df &= 2\,E:dE \cr &= 2\,E:dHA \cr &= 2\,EA^T:dH \cr &= 2\,EA^T:H'\circ d(XW) \cr &= 2\,EA^T\circ H':XdW \cr &= 2\,X^T(EA^T\circ H'):dW \cr } $$ Yielding the derivative as $$ \eqalign { \frac{\partial f}{\partial W} &= 2\,X^T(EA^T\circ H') \cr } $$ The Hadamard product is commutative, so this equals Thomas' result.