I have seen lots of people asking this question - $dF/dW = ??$ when $F = WX$. Here $W$ is a $m \times n$ matrix and $X$ is $n \times p$ matrix.
The simple answer they give is $X^{T}$. How did it appear to be like this?
I googled this question - CS231N of stanford gave an explanation of this thing. Yes if you derive it - it is supposed to be a higher order tensor (4 free indices). It is kind of like a matrix whose elements are itself a matrix.
In case you are thinking whether I checked this site questions before asking this question and thinking of closing this question - I would show some of my findings from here and other resources I came by.
This question attempted to demystify the answer. The answer given here is elaborate. But wait a sec, here he mentioned that this can be realized using Kronecker product. Now isn't it a bit way around? What if we want to derive it from the basic rules? (Like multiply two matrices and then deriving each of the $mp$ terms w.r.t all the matrix elements of $X$.
Resources mentioned in CS231N. Yes I checked those. I understand the materials on matrix derivative. And no, I can't find the correlation between these two.
What am I missing? How to derive these kind of expressions from the basics?
I want to make sure that I understand this. Thanks.
- The CS231N resource I mentioned. link - Vector, Matrix, and Tensor Derivatives Erik Learned-Miller
- Another resource from the same CS231N course link- Derivatives, Backpropagation, and Vectorization Justin Johnson
In index notation, the function can be written as $$F_{ik} = W_{ij} X_{jk}$$ The indices $\{i,k\}$ are not repeated and are called "free" indices,
but $\{j\}$ is a repeated "dummy" index and is implicitly summed over.
Now calculate the derivative with respect to the component $W_{qr}$ $$\eqalign{ \frac{\partial F_{ik}}{\partial W_{qr}} &= \frac{\partial W_{ij}}{\partial W_{qr}}\;X_{jk} \\ &= \delta_{iq}\delta_{rj}\;X_{jk} \\ &= \delta_{iq}\;X_{rk} \\ }$$ The symbol $\delta_{iq}$ is called a Kronecker delta. When $i=q$ it equals ${\tt 1}$ otherwise it's equal to $0$.
Since the derivative has 4 free indices, it is a 4th order tensor, whose dimensions are $(m\times p\times m\times n)$
Since higher order tensors are awkward to work with, most texts flatten the matrices $(F,W)$ into the vectors $(f,w)$ and then calculate the derivative using ordinary matrix notation. $$\eqalign{ {\rm vec}(F) &= {\rm vec}(IWX) = (X^T\otimes I)\,{\rm vec}(W) \\ f &= (X^T\otimes I)\,w \\ df &= (X^T\otimes I)\,dw \\ \frac{\partial f}{\partial w} &= (X^T\otimes I) \\ }$$ This result is a matrix, not a tensor; the symbol $\otimes$ represents the Kronecker product.