Calculating derivatives with respect to a weight matrix

1.8k Views Asked by At

I am currently doing a machine learning course and I am trying to wrap my head around back propagation. I watched these videos which helped clear things up a bit. I am trying to apply the same concepts to this neural network with just one training example:
Neural Network Diagram

This means we have:
$x = a^{(1)} = \begin{bmatrix}0.3\\0.9\end{bmatrix}$
$W^{(1)} = \begin{bmatrix}0.1&&0.8\\0.4&&0.6\end{bmatrix}$
$W^{(2)} = \begin{bmatrix}0.3&&0.9\end{bmatrix}$

We are also given the following as part of the question:
$y = 1$
$\eta = 1$ (the learning rate)
$C = \frac{1}{2}\sum_{n=1}^{N} (\hat{y} - y)$ (the loss function)
$sigmoid(x) = \frac{1}{1+e^x}$ (the activation function)

Using the logic in that video, for the second layer we get:
$z^{(2)} = W^{(1)}x = \begin{bmatrix}0.75\\0.66\end{bmatrix}$
$a^{(2)} = sigmoid(z^{(2)}) = sigmoid(\begin{bmatrix}0.75\\0.66\end{bmatrix}) = \begin{bmatrix}0.679\\0.659\end{bmatrix}$

And we get the following for the third layer:
$z^{(3)} = W^{(2)}a^{(2)} = \begin{bmatrix}0.3&&0.9\end{bmatrix}\begin{bmatrix}0.679\\0.659\end{bmatrix} = 0.797$
$\hat{y} = a^{(3)} = sigmoid(z^{(3)}) = sigmoid(0.797) = 0.689$

I think I can figure out my mistake if I can find the issue with my answer to how much the weights connecting the inputs to the second layer are. So that's what I'll focus on, below is my derivation of the change in the loss function with respect to the weights. Note: Since this question only has 1 training example, I have excluded the summation in the derivative.
$\frac{\partial{C}}{\partial{W^{(1)}}} = \frac{\partial}{\partial{W^{(1)}}}[\frac{1}{2}(\hat{y} - y)^2]$
$\frac{\partial{C}}{\partial{W^{(1)}}} = \frac{\partial}{\partial{\hat{y}}}[\frac{1}{2}(\hat{y} - y)^2]\cdot\frac{\partial}{\partial{W^{(1)}}}[sigmoid(z^{(3)})]$
$\frac{\partial{C}}{\partial{W^{(1)}}} = (\hat{y} - y)\cdot\frac{\partial}{\partial{z^{(3)}}}[sigmoid(z^{(3)})]\cdot\frac{\partial}{\partial{W^{(1)}}}(W^{(2)}a^{(2)})$
$\frac{\partial{C}}{\partial{W^{(1)}}} = (\hat{y} - y)\cdot sigmoid'(z^{(3)})\cdot\frac{\partial}{\partial{a^{(2)}}}(W^{(2)}a^{(2)})\cdot\frac{\partial}{\partial{W^{(1)}}}[sigmoid(z^{(2)})]$
$\frac{\partial{C}}{\partial{W^{(1)}}} = (\hat{y} - y)\cdot sigmoid'(z^{(3)})\cdot W^{(2)}\cdot\frac{\partial}{\partial{z^{(2)}}}[sigmoid(z^{(2)})]\cdot\frac{\partial}{\partial{W^{(1)}}}(W^{(1)}x)$
$\frac{\partial{C}}{\partial{W^{(1)}}} = (\hat{y} - y)\cdot sigmoid'(z^{(3)})\cdot W^{(2)}\cdot sigmoid'(z^{(2)})\cdot\begin{bmatrix}x^T\\x^T\end{bmatrix}$

I have calculated $sigmoid'(x)$ to be $\frac{e^{-x}}{(1+e^{-x})^2}$

You can see how I arrived at the answer for $\frac{\partial}{\partial{W^{(1)}}}(W^{(1)}x)$ here: Derivative of W1*x

Did I do my derivation correct? If I plug in the numbers I get the following for the new weight (which is incorrect):
$W^{(1)}_{new} = W^{(1)} - \eta (\hat{y} - y)\cdot sigmoid'(z^{(3)})\cdot W^{(2)}\cdot sigmoid'(z^{(2)})\cdot\begin{bmatrix}x^T\\x^T\end{bmatrix}$
$W^{(1)}_{new} = \begin{bmatrix}0.1&&0.8\\0.4&&0.6\end{bmatrix} - 1\cdot(0.689 - 1)\cdot sigmoid'(0.797)\cdot\begin{bmatrix}0.3&&0.9\end{bmatrix}\cdot sigmoid'(\begin{bmatrix}0.75\\0.66\end{bmatrix})\cdot\begin{bmatrix}0.3&&0.9\\0.3&&0.9\end{bmatrix}$ $W^{(1)}_{new} = \begin{bmatrix}0.1&&0.8\\0.4&&0.6\end{bmatrix} + 0.066\begin{bmatrix}0.08&&0.241\\0.08&&0.241\end{bmatrix}$
$W^{(1)}_{new} = \begin{bmatrix}0.105&&0.816\\0.405&&0.616\end{bmatrix}$

The correct answer is:
$W^{(1)}_{new} = \begin{bmatrix}0.101&&0.804\\0.404&&0.612\end{bmatrix}$

Sorry about the super long post, I tried to be as detailed and as obvious as I can with my working so that it is easy to follow and spot my mistake. Any help is greatly appreciated. Thanks in advance!

1

There are 1 best solutions below

2
On BEST ANSWER

$\def\s{{\sigma}}\def\p#1#2{\frac{\partial #1}{\partial #2}}\def\D{{\rm Diag}}\def\m#1{\left[\begin{array}{r}#1\end{array}\right]}$A nice way to write the derivative of the logistic sigmoid is $$\eqalign{ d\s(z) &= \left(S-S^2\right) dz \\ &{\rm where}\quad S = \D\big(\s(z)\big) \;=\; S^T \\ }$$ Then use differentials to do the backprop calculation of the gradient. $$\eqalign{ C &= \tfrac 12(\hat y-y):(\hat y-y) \\ dC &= (\hat y-y):d\hat y \\ &= (\hat y-y):d\s(W_2a_2) \\ &= (\hat y-y):(S_2-S_2^2)W_2\,da_2 \\ &= W_2^T(S_2-S_2^2)(\hat y-y):da_2 \\ &= W_2^T(S_2-S_2^2)(\hat y-y):d\s(W_1x) \\ &= W_2^T(S_2-S_2^2)(\hat y-y):(S_1-S_1^2)\,dW_1\,x \\ &= (S_1-S_1^2)W_2^T(S_2-S_2^2)(\hat y-y)x^T:dW_1 \\ \p{C}{W_1} &= (S_1-S_1^2)W_2^T(S_2-S_2^2)(\hat y-y)x^T \;\doteq\; G \\ }$$ Use $G$ to perform a gradient descent step $\big({\rm note\,that}\;W_2^T=x\;{\rm and}\,\;S_2=\hat y\big)$ $$\eqalign{ dW_1 &= -\eta G \\ &= -(S_1-S_1^2)x(\hat y-\hat y^2)(\hat y-y)x^T \\ &= (\hat y-\hat y^2)(1-\hat y)\;(S_1-S_1^2)xx^T \\ &= 0.06652\;(S_1-S_1^2)xx^T \\ }$$ Now evaluate the matrix parts of the equation $$\eqalign{ &xx^T = \m{0.09&0.27\\0.27&0.81} \quad S_1 = \m{0.75&0\\0&0.66} \\ &(S_1-S_1^2)\,xx^T = \m{0.0196 & 0.0588 \\ 0.0607 & 0.1820} \\ &dW_1 = \m{0.00130 & 0.00391 \\ 0.00403 & 0.01210} \\ &W_1^{new} = W_1+dW_1 = \m{0.10130 & 0.80391 \\ 0.40403 & 0.61210} \\\\ }$$


In several steps above, a colon is used as a convenient product notation for the trace (aka the matrix inner product), i.e. $$\eqalign{ A:A &= \big\|A\big\|_F^2 \\ A:B &= {\rm Tr}(A^TB) \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \\ }$$ when $(A,B)$ are vectors this definition corresponds to the ordinary dot product.

The properties of the underlying trace function permit the terms in such a product to be rearranged in a variety of different ways, e.g. $$\eqalign{ A:B &= B:A = B^T:A^T \\ CA:B &= C:BA^T = A:C^TB \\ }$$ Note that the matrix on each side of the colon must have the same dimensions.


I find it helpful to use a more systematic convention wherein $\s_k$ is the activation function, $W_k$ is the weight matrix, $b_k$ is the bias vector, $x_k$ is the input vector and $x_{k+1}$ is the output vector for the $k^{th}$ layer.

Then all of the equations can be condensed to $$x_{k+1} = \s_k\big(W_kx_k + b_k\big)$$ However, the subscripts are rather superfluous (and distracting) so writing this as $$x_+ = \s\big(Wx+b\big)$$ makes it clear that a Neural Network is simply a new way to compute vector iterations.