Rigorous calculation of gradient using matrix-calculus

96 Views Asked by At

Let $F_N=(W^N)^TW^{N-1}...W^1X$, where $W^j\in\mathbb{R}^{n\times n}$ for $1\leq j \leq N-1$, $W^N\in\mathbb{R}^n$, and $X\in \mathbb{R}^n$; therefore $F_N\in \mathbb{R}$. I am interested in calculating $\dfrac{\partial F_N}{\partial W^m}$ for $1\leq m\leq N$ using matrix-calculus notation.

Let's consider the case for $1\leq m\leq N-1$. Since $F_N$ is a scalar and $W^m$ is a $n\times n$ matrix, then the derivative must be an $n\times n$ matrix.

The matrix calculus calculator yields the result:

$$\dfrac{\partial F_N}{\partial W^m}=(W^{N-1})^T(W^{N-2})^T...(W^{m+1})^TW^N(W^{m-1}W^{m-2}...W^1X)^T$$.

This result is not obvious to me. I don't know where it comes from. Although I verified it and it works. If I were to do it manually, I'd do the following:

\begin{equation}\dfrac{\partial F_N}{\partial W^m}=\dfrac{\partial F_N}{\partial F_{N-1}}\dfrac{\partial F_{N-1}}{\partial F_{N-2}}...\dfrac{\partial F_m}{\partial W^m}\end{equation}

According to the wikipedia article on matrix calculus we get that $\dfrac{\partial F_N}{\partial F_{N-1}}$ is a vector because $F_N$ is a scalar and $F_{N-1}$ is a vector. The subsequent terms until $\dfrac{\partial F_m}{\partial W^m}$ are all $n\times n$ matrices because $F_j$ and $F_{j-1}$ are both vectors. Finally, $\dfrac{\partial F_m}{\partial W^m}$ should be a tensor of 3rd degree because $F_m$ is a vector and $W^m$ is a matrix.

Then, I end up with $\mathbb{R}^n\cdot\mathbb{R}^{n \times n}\cdot \mathbb{R}^{n\times n \times n}$, which I guess produces an $n \times n$ matrix. However, all of this feels very wrong and it feels like I am trying to make things work, instead of having a systematic way of computing this expression like the matrix calculus calculator.

I'd appreciate if someone can explain where the formula given by the calculator came from/how to derive it; or how to go from the formulation equation (second equation) to the result equation (first equation). And in general how to derive these formulas when it comes to matrix calculus and derivatives. For example, what happens when you have an outer product in the middle? or a dot product? I'd imagine there are many resources out there to answer this. I'd appreciate if someone can share them. So far I have found resources that are very introductory and don't really deal with derivatives using the chain rule.

1

There are 1 best solutions below

0
On BEST ANSWER

$ \def\p{\partial} \def\LR#1{\left(#1\right)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\M{\c{M}} \def\DM{\c{dM}} $When learning Matrix Calculus it is very helpful to use a consistent notation for the various quantities that are involved. Towards that end I'd recommend using uppercase letters for matrices, lowercase letters for vectors, and Greek letters for scalars.

It is also helpful to reserve superscripts to indicate operations like complex conjugates and transposes, $({\rm e.g.}\;A^T)$ and to reserve subscripts to indicate individual components, $({\rm e.g.}\;A_{ij})$.

Eliminating superfluous superscripts and subscripts has the added advantage of reducing the visual "clutter" so that you can really focus on the problem at hand.

One should also try to reduce a problem to the simplest form which preserves its essence.

One final issue which trips up students is correctly handling matrix transpositions. I've found that consistently using the Frobenius product (below) avoids these pitfalls.

The Frobenius product is simply a concise product notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a such a product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A \\&= \LR{A^TC}:B \\ \\ }$$


Using the above ideas, restate the problem as calculating the gradient of a scalar function, consisting of a chain of matrices multiplied by two vectors on each end, with respect to a matrix $(M)$ in the middle of the chain. $$\eqalign{ \phi &= w^T(L\M N)x \\ &= \LR{wx^T}:(L\M N) \\ &= \LR{L^Twx^TN^T}:\M \\ }$$ The actual calculation is very straightforward $$\eqalign{ d\phi &= \LR{L^Twx^TN^T}:\DM \\ \grad{\phi}{\M} &= {L^Twx^TN^T} \\ }$$ If you wish, you can rearrange this expression by moving one (or both) of the vectors inside of the transpose operations. In particular, the following rearrangement matches the referenced result $$\eqalign{ \grad{\phi}{M} &= L^Tw\,\LR{Nx}^T \\ }$$