Does this calculus argument involving rank-4 tensors make sense?

94 Views Asked by At

Edit: Completely rewritten to be shorter and easier to digest.

Background and the Actual Question: I'm trying to derive a gradient formula (back propagation) for a machine learning application. The function is a scalar function (the cost) of a matrix (the weights) but along the way I'm getting a derivative of a matrix with regard to a matrix, and I don't understand how to deal with this. My math cirriculum did not cover 4-dimensional matrices!

Below I'll use the language of tensors to avoid confusion, but I don't really know how they work beyond rank two.

I'm hoping the derivation will be pretty simple, since the function itself is simple (just a chain of linear transformations and scalar functions applied elementwise). I can work it out "symbolically" as though everything made sense, but I end up with rank-4 tensors and so on, and I don't know if it makes sense, or can make sense. I'll only write some of the steps.

Derivation: The cost function $C$ is $C=1^T C(\hat y) 1$, where $\hat y$ is a matrix ($m\times n_l$) of final predictions and the $1$ are vectors of ones of the appropriate length (so it's essentially adding up all the individual costs). I want to find the derivative of $C$ with regard to $w$, a matrix of weights. $\hat y$ is a function of $w$.

Step 1: $\dfrac{\partial C}{\partial w} = 1^T \dfrac{\partial C(\hat y)}{\partial w} 1$ by linearity, but the inside fraction is now a rank-4 tensor (derivative of each element of $\hat y$ with regard to each element of $w$). Is this sensible?

Step 2: $\dfrac{\partial C(\hat y)}{\partial w}=C'(\hat y)\odot \dfrac{\partial \hat y}{\partial w}$, where $\odot$ is Hadamard multiplication. This makes sense symbolically using the chain rule, but the dimensions don't work out -- what does it mean to multiply a rank-2 tensor and a rank-4 tensor? What we seem to need is to scale every element of the right tensor by the element of the left tensor which corresponds on the $y$-part, but I don't know if this is a "normal" operation.

[some steps omitted here]

Step 3: Let $z=x\omega+1_m b$, where $1_m$ is a vector of ones, so $x\omega$ and $1_m b$ are matrices of the same dimension as $z$, and everything but $x$ is constant in terms of $w$. Then $\dfrac{\partial z}{\partial w}=\dfrac{\partial x}{\partial w}\omega$ by linearity, but here we're multiplying a rank-4 tensor by a rank-2 tensor, and I'm honestly not sure how it should be interpreted.

[the rest of the steps are repeats of the above, or essentially similar]

What I've tried (besides the derivation above). It seems important that in the original cost function, we pretended that $1^T$ and $1$ are rank-1 tensors (vectors), but it makes more sense to call them rank-2 tensors (matrices), where one of the dimensions has length 1. Then $C$ is not really a constant but a $1\times 1$ matrix. This sort of logic can make the entire thing a chain of rank-4 tensors, which is consistent (but not obviously helpful?).

I've also looked at some software packages to see how they deal with this, to see if this can be made natural. Matlab has support for higher-dimensional matrices, but you can't multiply them natively, so this seems unsupported. Numpy has support for higher-dimensional arrays, and you can multiply them natively using .dot, but what it does is not explained in the documentation, and certainly the Hadamard multiplication above does not seem to be supported.

Finally, I've googled on high-dimensional matrices, and physicists and machine learning people use the word "tensor" for this situation, but when I google tensors, I get the abstract mathematics concept, which seems theoretically related but not useful for solving this problem. In any case this should be easy, but hasn't been addressed in any relevant class I ever took, or any introductory book I've seen.

I would appreciate any help on any of these steps!