Gradient Equations as a system of polynomials

51 Views Asked by At

I was trying to understand Eq(3) of [1] and was a bit overwhelmed by matrix related short-hand notations to make the equation concise.

Eq (3)

Let $W_i \in \mathbb{R}^{d_{i-1} \times d_i}$, $W = W_{H+1} \dots W_1$, $U^\top_i = \prod^{H+1}_{j = i + 1} W^\top_j$, and $V^\top_i = \prod^{i-1}_{j = 1} W^\top_j$. And the objective function $L(W) = \bar{L}(W) + \lambda \sum^{H+1}_{i=1} || W_i ||^2_2$ with $\bar{L}(W) = \frac{1}{2} \sum^m_{i=1} || (W X)_{\cdot, i} - Y_{\cdot, i} ||^2_2 $.

Then, it states that we can obtain the following partial derivative as a system of polynomials

$\frac{\partial L}{\partial W_i} = U^\top_i \big( W \big( \sum^m_{k=1} x_k x^\top_k \big) - \big( \sum^m_{k=1} y_k x^\top_k \big) \big) V^\top_i + \frac{\lambda}{2} W_i$

But it doesn't have the derivation... can someone prove this?

reference

[1] Mehta, Dhagash, et al. "The loss surface of deep linear networks viewed through the algebraic geometry lens." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.9 (2021): 5664-5680. Paper

1

There are 1 best solutions below

2
On BEST ANSWER

Define a partial product function $$\eqalign{ \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\o{{\tt1}} \def\l{\bar L} \def\R{\right} \def\L{\left} \def\qiq{\quad\implies\quad} \def\trace{\operatorname{Tr}} P(i,j) &= \begin{cases} {\displaystyle\prod_{k=i}^{j}}\;W_k \quad {\rm if}\;i\ge j \\ \\ \;I \qquad\quad {\rm otherwise}\\ \end{cases} }$$ and use it to expand the differential of $W$ $$\eqalign{ W &= P(H,\o) \qiq dW &= \sum_{k=H}^{\o}P(H,k+\o)\cdot dW_k\cdot P(k-\o,\o) \\ }$$ $\sf NB$:$\,$ In the above, I redefined $(H+\o)\to H$ because it's a dumb notation that just adds visual clutter to the problem.

The Frobenius product will also be very useful $$\eqalign{ A:B &= \sum_j\sum_k A_{jk}B_{jk} \;=\; \trace(A^TB) \\ A:A &= \L\|A\R\|^2_F \\ A:B &= B:A \;=\; B^T:A^T \\ (AB):C &= A:(CB^T) \;=\; B:(A^TC) \\ }$$ For typing convenience, define the matrix $$M = WXX^T-YX^T$$ The $\l$ part of the cost function can be differentiated as $$\eqalign{ \l &= \tfrac12 \L(WX-Y\R):\L(WX-Y\R) \\ d\l &= \L(WX-Y\R):\L(dW\,X\R) \\ &= \L(WXX^T-YX^T\R):dW \\ &= M:\L(\sum_{k=H}^{\o}P(H,k+\o)\cdot dW_k\cdot P(k-\o,\o)\R) \\ &= \sum_{k=H}^{\o}P(H,k+\o)^T\cdot M\cdot P(k-\o,\o)^T:\,dW_k \\ }$$ Since we're only interested in the $dW_k$ component, drop the summation to obtain $$\eqalign{ d\l &= P(H,k+\o)^T\cdot M\cdot P(k-\o,\o)^T:\,dW_k \\ \grad{\l}{W_k} &= P(H,k+\o)^T\cdot M\cdot P(k-\o,\o)^T \\ }$$


As you delve further into the field of Machine Learning / Neural Networks, you will discover that most of the books and journal articles use poor mathematical notation (including those by respected researchers). Worse, it is inconsistent from author to author. Good luck!

A simple example from the current article $$YX^T = \sum_k y_kx_k^T$$