Trying to understand why a certain matrix derivative is sparse

68 Views Asked by At

I'm having a hard time understanding matrice derivatives with respect to derivatives, and came upon the following exercise which I am not sure how to solve.

Let there be matrices ${\bf X} \in \Bbb R^{64 \times 1024}$ and ${\bf W} \in \Bbb R^{512 \times 1024}$. Let ${\bf Y} := {\bf X} {\bf W}^\top$. I am interested in understanding the derivative $\frac{\partial {\bf Y}}{\partial {\bf X}}$.

  1. Am I correct in saying that its shape is $64 \times 1024 \times 1024 \times 512$?

  2. It is stated in a textbook with a similar exercise that it is sparse, but I can't figure out why or which elements.

2

There are 2 best solutions below

2
On BEST ANSWER

Welcome to MSE :D

Simple way to deal with derivatives involving matrix multiplications is to view it via the summation form $$ Y=XW^T\\ Y_{ij}=\sum_k^{1024}X_{ik}W_{jk} $$ So what you mean by $\partial Y/\partial X$ is this 4d "tensor" $$ \frac{\partial Y_{ij}}{\partial X_{kl}} $$ The exact shape Depend on your convention of formulating these matrix derivatives. If the indices are ijkl then your shap shall be $(64,512,64,1024)$. I think your shape is wrong.

To evaluate this tensor, just look at the summation formula $$ \frac{\partial Y_{ij}}{\partial X_{kl}}=\frac{\partial\sum_m^{1024}X_{im}W_{jm}}{\partial X_{kl}}\\ =\sum_m^{1024}W_{jm}\frac{\partial X_{im}}{\partial X_{kl} }\\ =\sum_m^{1024}W_{jm}\delta_{ik}\delta_{ml}\\ =W_{jl}\delta_{ik} $$ Kronecker Delta function in which if $a=b$ $\delta_{ab}=1$, else $\delta_{ab}=0$ .

Given so many $0$, the target tensor is sparse.

0
On

There are multiple conventions: Let's say $X \in \mathbb R^{a \times b}$ and $W \in \mathbb R^{c \times d}$ for clarity.

When you write $\frac{\partial Y}{\partial X}$ then we usually mean the object containing the derivativs of all $Y_{ij}$ with respect to all $X_{kl}$, in which case for every tuple $(i,j,k,l)$ we get a scalar

$$\frac{\partial Y_{ij}}{\partial X_{kl}}$$

Even for vector-by-vector derivatives there are already two conventions (see https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions) on how to lay them out, and I think for matrix-by-matrix derivatives there really aren't any widespread conventions, so we just usually deal with indices. So in that sense it certainly makes sense to say it has a shape of $(a,b,c,d)$ or a permutation thereof, as long as it is consistent with the use of your indices.

Regarding the sparseness: Consider an entry of first row of $X$, that is $X_{1l}$ ($k=1$). After the matrix multiplication in $Y$ this value only appears in the first row of $Y$, that is in the values of $Y_{1j}$ ($i=1$). So we can already say that

$$\frac{\partial Y_{ij}}{\partial X_{kl}} = 0 \quad \forall k \neq i$$

This is the sense in which this derivative is sparse.