Multivariable chain rule: how to take this derivative with respect to a matrix?

640 Views Asked by At

I have a simple model and I want to update the parameters using a gradient descent algorithm. Thus I must find derivative with respect to my parameters. Here is my model:

$$s = Wx + b$$ $$a = max(0, s)$$ $$t = Ma + c$$ $$f = {1 \over 2}\Sigma(t_i - y_i)^2$$

Where: $x$ is vector of length $n$, $b$ is vector length $m$, $W$ is size $(m,n)$, $s$, $a$ are vectors of length $m$, $c$ is length $p$, $M$ is size $(p, m)$, $t$, $y$, $c$ are vectors of length $p$, $f$ is a real-valued (scalar output) function. I am minimizing $f$, this is my "loss function".

I need to update the parameter $W$, so I need something like "${df \over dW}$", whatever that quantity is. I suppose it must be a matrix of the same size so I can do something like $W = W - \gamma{df \over dW}$ in my program.

What is a systematic way of deriving this quantity? I know the chain rule will be involved, and I understand how to take gradients, but I have never taken a derivative of a scalar function with respect to a matrix. This example is illustrative of a larger model I'm working on.

Thanks!

1

There are 1 best solutions below

1
On BEST ANSWER

Given a function of a scalar argument ($x$) we can write its differential in terms of its derivative as $$df = f'\,dx$$ When the function is applied element-wise to a matrix argument ($X$), we can use the Hadamard product to write the differential as $$df = f'\circ dX$$

The next tricky bit is taking the derivative of the max function, but we can make use of the Heaviside step function to write the differential of $a$ as $$da = H(s)\circ ds$$ We will substitute this differential into the differential of the loss function below.

Writing the loss function in terms of the Frobenius product (:) and taking its differential yields $$\eqalign{ f &= \frac{1}{2}(t-y):(t-y) \cr &= \frac{1}{2}z:z \cr\cr df &= z:dz \cr &= (t-y):dt \cr &= (t-y):M\,da \cr &= M^T(t-y):da \cr &= M^T(t-y):H\circ ds \cr &= (M^T(t-y))\circ H:ds \cr &= \big(M^T(t-y)\big)\circ H:dW\,x \cr &= \Big(\big(M^T(t-y)\big)\circ H\Big)x^T:dW \cr }$$ Since $df=\big(\frac{\partial f}{\partial W}:dW\big),\,$ the gradient must be $$ \frac{\partial f}{\partial W} = \Big(\big(M^T(t-y)\big)\circ H\Big)x^T $$ In the above, I've made use of the mixed product rule for Hadamard-Frobenius products $$A:B\circ C = A\circ B:C$$ If you're uncomfortable with the Frobenius product, you can replace it with the trace function, to which it is equivalent $${\rm tr}(A^TB)=A:B$$ Also note that, unlike the normal matrix product, both the Frobenius and Hadamard products are commutative.