Matrix derivatives, problem with dimensions

663 Views Asked by At

I'm trying to find a derivative of function: $$L = f \cdot y; f = X \cdot W + b$$

Matrices shapes: $X.shape=(1, m), W.shape=(m,10), b.shape=(1, 10), y.shape=(10, 1)$ I'm looking for $\frac{\partial L}{\partial W}$

According to chain-rule: $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial W} $$

Separately we can find: $$ \frac{\partial L}{\partial f} = y$$ $$ \frac{\partial f}{\partial W} = X$$

And the problem is that the derivative's dimension of $\frac{\partial L}{\partial W} $ according to my formula is $(10, m)$. However, the dimension should coincide with dimension of $W$.

Also I was advised to find differential of $L$:

$$ d(L) = d(f \cdot y) = d(f) \cdot y = d (X \cdot W + b)y = X \cdot dW \cdot y $$ But I do not understand how can I get from this the derivative $\frac{\partial L}{\partial W} $

1

There are 1 best solutions below

4
On

Let's use a convention where a lowercase Latin letter always represents a column vector, an uppercase Latin is a matrix, and a Greek letter is a scalar.

Using this convention your equations are $$\eqalign{ f &= W^Tx + b \cr \lambda &= f^Ty \cr }$$ As you have noted, the differential of the scalar function is $$\eqalign{ d\lambda &= df^Ty = (dW^Tx)^Ty = x^TdW\,y \cr }$$ Let's develop that a bit further by introducing the Trace function $$\eqalign{ d\lambda &= {\rm Tr}(x^TdW\,y) = {\rm Tr}(yx^TdW) \cr }$$ Then, depending on your preferred Layout Convention, the gradient is either $$\eqalign{ \frac{\partial\lambda}{\partial W} &=yx^T \quad{\rm or}\quad xy^T \cr }$$ Since you expected the the dimensions of the gradient to be those of $W$, it sounds like your preferred layout is $xy^T$

Also note that $\frac{\partial f}{\partial W}\neq X.\,$ The gradient is a 3rd order tensor, while $X$ is just a 2nd order tensor (aka a matrix). The presence of these 3rd and 4th order tensors as intermediate quantities in the chain rule can make it difficult/impossible to use in practice.

The differential approach suggested by your advisor is often simpler because the differential of a matrix is just another matrix quantity, which is easy to handle.