I am trying to understand deriving the derivative of a matrix equation of the form:
$$a = \tanh(WX + b)$$ in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:
$$\partial a/ \partial X= W^T(1 - \tanh(WX+b)^2)$$ I don't understand how $W$ moves to the left hand side of $(1 - \tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:
$$\partial f(u)/ \partial x= f'(u)\partial u/ \partial x$$ so in my example $\partial u/ \partial x$ is on the right hand side of the equation.
- $$\partial a/ \partial W=(1 - \tanh(WX+b)^2)X^T$$ in which I don't understand how $X$ gets transposed and moves to the left hand side.
Define the variables $$\eqalign{ y &= Wx+b \cr a &= \tanh(y) \implies A = {\rm Diag}(a) \cr }$$ Now calculate the differential and gradient of $a$ wrt $x$ $$\eqalign{ da &= (1-a\odot a)\odot dy \cr &= (I-A^2)\,dy \cr &= (I-A^2)W\,dx \cr \frac{\partial a}{\partial x} &= (I-A^2)W \cr }$$ Depending on your Layout convention, you might prefer the transpose of this result $$\eqalign{ \frac{\partial a}{\partial x} &= W^T(I-A^2) \cr\cr }$$ Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor. $$\eqalign{ da &= (1-a\odot a)\odot dy \cr &= (I-A^2)\,dW\,x \cr &= (I-A^2){\mathcal H}x:dW \cr \frac{\partial a}{\partial W} &= (I-A^2){\mathcal H}x \cr }$$ The above steps use several different product notations which you may not be familiar with $$\eqalign{ \lambda &=A:B &\implies \lambda = \sum_i\sum_j A_{ij} B_{ij} \cr L &=A\odot B &\implies L_{ij} = A_{ij} B_{ij} \cr C &= AB &\implies C_{ij} = \sum_k A_{ik} B_{kj} \cr }$$ The symbol ${\mathcal H}$ is a $4th$ order tensor with components $${\mathcal H}_{ijkl} = \delta_{ik} \delta_{jl}$$