Inconsistent multiplication mechanism in backpropagation

110 Views Asked by At

When doing backpropagation to update biases in a neural network, when to use dot product and when to use entry-wise matrix multiplikation (hadamard product)?

Assume I obtain the following using the chain rule:

$$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial h_2} \cdot \frac{\partial h_2}{\partial w_2} $$

... and the resulting dimensions are the following, respectively: $$(9, 200) = (9,1) \cdot (9,1) \cdot (200,9). $$

If they dimensions have to match, I have to entry-wise multiplicate the two first arrays, and then use the dot product between the remaining to factors.

But what is the mathematics behind this? Why can't I use the same multiplication mechanism on all factors?


Let the operator '$:=$' indicate the shape of the left hand side.

We have that $\text{input} := (784, 1)$, $\text{target} := (9, 1)$, $w_1 := (200, 784)$ and $w_2 := (9, 200)$. Then the shapes become:

$$h_1 = w_1 \cdot \text{input} := (200, 1)$$ $$z_1 = \text{sigmoid}(h_1) := (200, 1)$$ $$h_2 = w_1 \cdot z_1 := (9, 1)$$ $$z_2 = \text{sigmoid}(h_2) := (9, 1)$$ $$L = \sum_{i=1}^9(\text{target}^i - z_2^i)^2 := (1,1)$$

1

There are 1 best solutions below

0
On

I always prefer to use as a 'convention' that the gradient of a scalar by a matrix should have the same dimension as the matrix (this is called denominator layout, see wiki). So $\frac{\partial L}{\partial \mathbf{W}_2}$ should have the dimension of $\mathbf{W}_2$ In your case, the gradient is $$ \frac{\partial L}{\partial \mathbf{W}_2}= \left[ (\mathbf{z}_2-\mathbf{t}) \circ \sigma'(\mathbf{h}_2) \right] \mathbf{z}_1^T $$ Note : the loss function was multiplied by $1/2$ in this case and $\circ$ means Hadamard product.