Back Prop Algorithm Details

22 Views Asked by At

I'm working on a paper in an area I haven't worked in. I'm looking at a different way of thinking about what neural network architectures are doing, in a fundamental way. I've derived all the forward algorithm formulas and rather than re-derive the back prop algorithm, I thought I'd short cut, and use something generic that's already been figured out. The details of different types of back prop don't matter to me at this stage, but the exact formalism matters a lot.

Wikipedia (https://en.wikipedia.org/wiki/Backpropagation) has the following formula:

$$\nabla_x C = \Big{(}W^1\Big{)}^T \cdot \Big{(}f^1\Big{)}^{\prime} \circ\ldots\circ \Big{(}W^{L-1}\Big{)}^T \cdot \Big{(}f^{L-1}\Big{)}^{\prime} \circ \Big{(}W^{L}\Big{)}^T \cdot \Big{(}f^{L}\Big{)}^{\prime} \circ \nabla_{a^L} C$$

where $C$ is the cost function, $W^i$ is the weight matrix going into layer $i$, $f^i$ is the thresholding function of the nodes at layer $i$ (typically $f^i = f^j\;\forall i,j$), $\Big{(}f^i\Big{)}^{\prime}$ is the derivative of the thresholding function $f^i$, $a^L$ is the activation vector of layer $L$ (the last layer).

The $\circ$ operation in this formula is identified as the Hadamard product (element-wise product), so that's simple enough. But what is the $\cdot$ operator? In more detail, what is the type of $\Big{(}f^i\Big{)}^{\prime}$ (it looks like a scalar, since $f^i$ is a map from $\mathbb{R}$ to $\mathbb{R}$), and it's by propagating the derivative into the components of the input scalar that the weight matrices arise in the derivative expression. However, what I think it is is a vector of derivatives one for each node in layer $i$. Thus what we seem to have in the above is a matrix of weights multiplying a vector of nodes... but that doesn't need a hadamard product or a dot product. We could have a Hadamard product of a bunch of vectors, each formed by doing a matrix multiply on a vector, but that doesn't call for a dot product... maybe they're just using a dot symbol but mean matrix-vector multiply?

So I'm confused about the types of these factors, and the operations being done on them.

This seems to all be pretty standard formalism, so if anyone can easily tell me the types and operations...

I'd really appreciate it... :)