Gradient of $\frac{dL}{dX}$ using chain rule
With the chain rule, $\frac{dL}{dX} = \frac{dY}{dX} \cdot \frac{dL}{dY}$, and $\frac{dY}{dX} = W$ for the product Y = X•W.
Q1
I suppose I need to make $\frac{dY}{dX}$ into a transpose $W^\intercal$ to match the shape. For instance, if X shape is (, 3) as per numpy, the last axis of shape($\frac{dY}{dX}$) needs to be 3 (so that $\frac{dY}{dX} \cdot dX^\intercal \rightarrow dY$ : (m,3) • (3,n) → (m, n)?)
However, not sure if this is correct and why, hence appreciate any explanations.
Q2
How can I apply the chain rule formula to matrix?
$\frac{dL}{dX} = W^\intercal \cdot \frac{dL}{dY}$
This cannot be calculated because of the shape mismatch where $W^\intercal$ is (4, 3) and $\frac{dL}{dY}$ is (, 4).
Likewise with $\frac{dL}{dW} = X^\intercal \cdot \frac{dL}{dY}$ because $X^\intercal$ is (,3) and $\frac{dL}{dY}$ is (, 4).
What thinking, rational or transformation can I apply to get over this?
There are typos in the diagram.
(,4)is(4,)etc. In my brain, 1D array of 4 element was(,4)but in NumPy, it is(4,).
For $\frac{dL}{dX}$
I saw the answer is swapping the positions, but no idea where it came from and why.
$\frac{dL}{dX} = \frac{dL}{dY} \cdot W^\intercal$
Instead of:
$\frac{dL}{dX} = W^\intercal \cdot \frac{dL}{dY}$
For $\frac{dL}{dW}$
The shape of $X^\intercal$ (,3) and $\frac{dL}{dY}$ (, 4) need to be transformed into (3, 1) and (1, 4) to match the shapes, but no idea where it came from and why.
Geometry
In my understanding, X•W is extracting the $\vec{\mathbf{W}}$ dimension part of X by truncating the other dimensions of X geometrically. If so, $\frac{dL}{dX}$ and $\frac{dL}{dW}$ are restoring the truncated dimensions? Not sure this is correct but if so, would it be possible to visualize it like X•W projection in the diagram?



Thanks to @Reti43 for pointing to the reference. The detail math is provided by the cs231 Justin Jonson (now in Michigan University) as http://cs231n.stanford.edu/handouts/linear-backprop.pdf which is also available as Backpropagation for a Linear Layer.
cs231n lecture 4 explains the idea.
The math calculation from step (5) to (6) seems to be a leap, because dot product would not be resulted from two 2D matrices, and
numpy.dotwould produce matrix multiplication asnp.matmul, hence it would not be a dot product.The answer in numpy function to use for mathematical dot product to produce scalar addressed a way.
Format of the Weight vector W
Need to note on the weights representation of
Wby Justin Johnson.In Coursera ML course, Andrew Ng uses a row vector to capture weights of a node. When the number of features in input to a layer is
n, the row vector size isn.Justin Johnson uses a row vector to represent a layer size, the number of nodes in a layer. Hence if there are
mnodes in a layer, the row vector size ism.Hence the weight matrix for Andrew Ng is
m x nmeaning m rows of weight vectors, each of which are weights fornfeatures for a specific node.The weight matrix for Justin Johnson is
n x mmeaning n rows of weight vectors, each of which are weights formnodes in a layer, per feature.I suppose Justin Johnson regards
layer is a functionwhereas Andrew Ng regardsnode is a function.As I studied Andrew Ng's ML course first, I am using
weight vector per nodeapproach which results inW as m x n matrix. My confusion came from applyingW = m x nto Justin Jhonson's paper.Understanding
My understanding by reading through the Justin Johnson's paper is below.
Dimension analysis
First frame the dimensions/shapes of the gradients.
Derive the gradient
Using the simple one input record
X shape(d,), derive thedL/dXand extend it to two dimensional inputX shape(n, d)resulting indL/dY @ W. This is different fromdL/dy @ W.Tin the Justin Johnson's paper, because of the difference of the weight matrix W representation.If something incorrect, very much appreciate any feedbacks.