suppose I have two 2-by-2 matrices X and W. I will use $@$ for matrix multiplication and $*$ for elementwise multiplication.
let $Z=X@W^T$ and $Y=sigmoid(Z)=\frac{1}{1+exp(-Z)}$ and the gradient of y to itself is a 2-by-2 matrix of 1's
By the help from this website I have $\frac{\partial{Y}}{\partial{X}}=Y*(1-Y)@W$. Now how can I differentiate $\frac{\partial{Y}}{\partial{X}}$ again w.r.t. $X$? There's elementwise and matrix multiplications in it and I messed up every time using chain rule. Can someone help me using proper math? Thanks
Use the symbol $(\odot)$ to denote the elementwise/Hadamard product, $(\otimes)$ to denote the Kronecker product, and $(\cdot )$ to denote the standard matrix product. $$\eqalign{ Z &= X\cdot W^T &\iff z = {\rm vec}(Z) = (W\otimes I)\cdot{\rm vec}(X) = M\cdot x \\ Y &= \sigma(Z) &\iff y = {\rm vec}(Y) = \sigma(z) \\ A &= Y-Y\odot Y &\iff a = {\rm vec}(A)\;\;\;\iff\; D={\rm Diag}(a)\\ dY &= A\odot dZ \\ dy &= a\odot dz = D\cdot dz &\;\;=\;\; D\cdot M\cdot dx \\ \frac{\partial y}{\partial x} &= D\cdot M &\iff\; \boxed{\color{\red}{\;\frac{\partial Y}{\partial X} = \vec{\mu}\cdot D\cdot M\cdot\vec{\nu}}\;} \\ }$$ where the components of the third-order tensors $(\vec{\mu},\vec{\nu})$ in the final expression are given by $$\eqalign{ {\vec\nu}_{\ell jk} &= \begin{cases} 1\quad{\rm if}\;\;\ell=j+2k-2 \\ 0\quad{\rm otherwise} \\ \end{cases} \\ {\vec\mu}_{jk \ell} &= {\vec\nu}_{\ell jk} \\ }$$ and the index ranges are $$\eqalign{ 1&\le\; j,k \;&\le 2, \qquad 1&\le\; \ell \;&\le 4 \\\\ }$$
An alternative approach (which avoid tensors) is to note that $$\eqalign{ \frac{\partial X}{\partial X_{ij}} &= E_{ij} \\ }$$ where $E_{ij}$ is a matrix whose $(i,j)^{th}$ element equals one and all other elements equal zero.
Use this to calculate a component-wise gradient $$\eqalign{ dY &= A\odot\Big(dX\cdot W^T\Big) \\ \frac{\partial Y}{\partial X_{ij}} &= A\odot\Big(E_{ij}\cdot W^T\Big) \\ }$$