d(vector)/d(matrix) in neural-network-deep-learning context

307 Views Asked by At

When learning about Neural Networks, you are bound to stumble upon the technique of backpropagation/gradient descent. To give a brief explanation of the scenario we are to look at,

$$L = g(Wx+b) = g(a).$$

L is a scalar, W is a matrix of appropriate size, x is a vector given as input, b is a vector of appropriate size. g is a function that takes a vector of appropriate size and evaluates a scalar.

W and b are to be updated in the opposite direction of the gradient, so that L can be minimized.

Many textbooks, lectures and blogs, including Speech and Language Processing (3rd ed. draft) or CS231n, use notations(See page 5) that look like this.

$$\frac{\partial L}{\partial V} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial V}$$

here, L is the "loss function" defined above, a, z are vectors of appropriate size, and V is a matrix of parameters, like W discussed above. Overall this equation was written so that we may find the derivative of L by "the weights" V. From this, it seems only logical to assume that we may attempt to get $\frac{\partial L}{\partial W}$ in this manner.

$$\frac{\partial L}{\partial W} = \frac{\partial g}{\partial a} \frac{\partial a}{\partial W}$$

But here is the problem I have with both expressions: on the right-hand side, there is a derivation of a vector by a matrix. Now, from a few sources I know that the notation nor the multiplication of this Frankenstein is not defined rigorously, if it is defined at all.

Since we typically add $\eta \frac{\partial L}{\partial W}$ to $W$ directly, we must be following the Denominator-layout notation as explained here, but I can't seem to figure the $\partial a/\partial W$ part out. I did look over as many blogposts as I can, but virtually none of them actually show how the calculation should be done.

So my question is as follows.

  1. Is that kind of notation(da/dW) even legal?
  2. If it is legal, how should it be evaluated and multiplicated in this context? Can you give an example?
  3. Is da/dW possible to evaluate without doing it term-by-term?(evaluating da/dW11, da/dW12, ... and compiling them up later in some way.) What about dL/dW?

Thank you in advance.

2

There are 2 best solutions below

0
On BEST ANSWER

When tensors get involved (and $\frac{\partial \text{vector}}{\partial\text{matrix}}$ is a 3d tensor), it is useful to keep in mind that the chain rule actually states

$$ D[f\circ g](x_0) = D[f](g(x_0)) \circ D[g](x_0) $$

I.e. the derivative of composition of functions is the composition of their derivatives. Now in the context of linear algebra, matrix multiplication "$\cdot$" is essentially the same as the composition "$\circ$" of linear maps, so we tend to write $D[f\circ g] = D[f]\cdot D[g]$ instead. However tensors can be combined linearly in many different ways, and this notation tends to come to its limits.

I recommend to take a look at http://www.matrixcalculus.org/ (CAS engine that can help to check formulas).

This paper https://arxiv.org/pdf/1208.0197.pdf may also interest you.

0
On

As you've discovered, the chain rule can be problematic when the intermediate quantities cannot be expressed in matrix notation. But an easy way to solve your problem is to use differentials.

For typing convenience, define the vector $$ \frac{\partial L}{\partial a} = \frac{\partial g}{\partial a} = p $$ Write the differential of $L$ in terms of $a$, then perform a change of variables to $W$. $$\eqalign{ dL &= p:da \cr &= p:dW\,x \cr &= px^T:dW \cr \frac{\partial L}{\partial W} &= px^T \cr\cr }$$


In the above derivation, a colon was used to denote the trace/Frobenius inner product, i.e. $$A:B = {\rm Tr}(A^TB)$$ The cyclic property of the trace allows this product to be rearranged in different ways, e.g. $$\eqalign{ A:BC &= AC^T:B &= B^TA:C \cr A:B &= A^T:B^T &= B:A \cr }$$