How to write chain rule when outputs are vectors

52 Views Asked by At

Consider the following machine learning problem: We have input matrix $X_{d \times N}$ and output matrix $y_{o \times N}$, where $N$ is the number of samples, $d$ is the input dimension and $o$ is the output dimension. We wish to learn a linear model $\hat y = W X$ to estimate $y$ closely. Note that $\hat y$ is an $o \times N$ matrix. We use the following loss function: $L = \frac{1}{2}\| y - \hat y\|_F^2$.

I want to write the chain rule considering all the matrix forms. In particular, I first compute $\nabla_y L = \hat y - y$. Next, I want to use chain rule to find the gradient of $L$ wrt $W$:

$$\nabla_W L = \nabla_y L \cdot \nabla_W y$$

My understanding is since $y$ and $W$ are matrices, $\nabla_W y$ should not be a matrix and should be a 4th order tensor with each element $(i,j,k,l)$ equal to $\frac{\partial y_{i,j}}{\partial W_{k,l}}$. I'm confused how to proceed with multiplying this 4th order tensor with a matrix.

At the same time, I learned from a lecture that the gradient update (used e.g. in NN) is simply $\nabla_W L = (\hat y- y) X^\top$, which is confusing given above.

My attempt: We have $y = WX$ for every row of matrix $y$ and $W$ (denoted by indices $i$ below) I can write:

$$y_i^\top = W^\top_i X$$

Transposing both sides, we get $y_i = X^\top W_i$. This means that the derivatives of $W_i$ only depends on $y_i$ and we can also compute the Jacobian of $y_i$ wrt to $W_i$ which is $X^\top$. But I am not sure how to connect this derivation to chain rule and the 4th order tensor operation above.

1

There are 1 best solutions below

0
On BEST ANSWER

$ \def\L{{\lambda}} \def\Y{{\widehat Y}} \def\n{\nabla} \def\o{{\tt1}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $An important step to avoid confusion is to adopt a consistent naming convention. Toward that end, I'll use uppercase letters to denote matrices, lowercase letters for column vectors, and Greek letters for scalars.

Using the Frobenius product $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \frob{A}^2 \qquad \{ {\rm Frobenius\;norm} \}\\ A:B &= B:A \;=\; B^T:A^T \\ \LR{AB}:C &= A:\LR{CB^T} \;=\; B:\LR{A^TC} \\ }$$ will avoid silly transposition errors.

Use the above notation to rewrite the loss function and calculate its differential and gradient. $$\eqalign{ \L &= \tfrac12\LR{\Y-Y}:\LR{\Y-Y} \\ d\L &= \LR{\Y-Y}:d\Y \\ \grad{\L}{\Y} &= \LR{\Y-Y} \\ }$$ To obtain the gradient wrt $W,\,$ substitute the differential of $\Y$ before extracting the gradient $$\eqalign{ d\L &= \LR{\Y-Y}:\c{d\Y} \\ &= \LR{\Y-Y}:\CLR{dW\,X} \\ &= \LR{\Y-Y}X^T:dW \\ \grad{\L}{W} &= \LR{\Y-Y}X^T \\ }$$


As you have discovered, the chain rule can be difficult to apply in Matrix Calculus because it involves higher-order tensors (i.e. matrix-by-vector, vector-by-matrix, and matrix-by-matrix gradients) which are difficult to calculate, awkward to manipulate, and don't fit into standard matrix notation.

Furthermore, these tensors are not particularly interesting. They are merely intermediate quantities that are required by the chain rule.

On the other hand, the differential of a matrix behaves like a matrix. In particular, it can be written using standard matrix notation and it obeys all of the rules of Matrix Algebra.