I'm having a hard time understanding matrice derivatives with respect to derivatives, and came upon the following exercise which I am not sure how to solve.
Let there be matrices ${\bf X} \in \Bbb R^{64 \times 1024}$ and ${\bf W} \in \Bbb R^{512 \times 1024}$. Let ${\bf Y} := {\bf X} {\bf W}^\top$. I am interested in understanding the derivative $\frac{\partial {\bf Y}}{\partial {\bf X}}$.
Am I correct in saying that its shape is $64 \times 1024 \times 1024 \times 512$?
It is stated in a textbook with a similar exercise that it is sparse, but I can't figure out why or which elements.
Welcome to MSE :D
Simple way to deal with derivatives involving matrix multiplications is to view it via the summation form $$ Y=XW^T\\ Y_{ij}=\sum_k^{1024}X_{ik}W_{jk} $$ So what you mean by $\partial Y/\partial X$ is this 4d "tensor" $$ \frac{\partial Y_{ij}}{\partial X_{kl}} $$ The exact
shapeDepend on your convention of formulating these matrix derivatives. If the indices areijklthen your shap shall be $(64,512,64,1024)$. I think your shape is wrong.To evaluate this tensor, just look at the summation formula $$ \frac{\partial Y_{ij}}{\partial X_{kl}}=\frac{\partial\sum_m^{1024}X_{im}W_{jm}}{\partial X_{kl}}\\ =\sum_m^{1024}W_{jm}\frac{\partial X_{im}}{\partial X_{kl} }\\ =\sum_m^{1024}W_{jm}\delta_{ik}\delta_{ml}\\ =W_{jl}\delta_{ik} $$ Kronecker Delta function in which if $a=b$ $\delta_{ab}=1$, else $\delta_{ab}=0$ .
Given so many $0$, the target tensor is sparse.