Can someone please explain this chain rule based derivation to me?

210 Views Asked by At

$$ \text{Loss}(y, \hat{y}) = \sum_{i=1}^n \left( y- \hat{y} \right)^2 $$ $$ \begin{split} \frac{\partial \text{Loss}(y, \hat{y})}{\partial W} &= \frac{\partial \text{Loss}(y, \hat{y})}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial W} \quad \text{where}~z = Wx + b \\ & = 2(y-\hat{y}) \cdot \text{derivative of sigmoid function}\cdot x \\ & = 2(y - \hat{y})~ z(1-z)~ x \end{split} $$

Chain rule for calculating derivative of the loss function with respect to the weights

Sigmoid of $x = \frac{1}{(1 + (e^{-x}))}$

Sigmoid derivative of $x = x\cdot (1-x)$

$y$ here is the required output. $\hat{y}$ here is calculated output.

$\hat{y}$ = sigmoid of (input * weight) where input is $x$ and weight is $W$.

The $\hat{y}$ and $y$ both are $1 \times 1$ matrix. Also, input and weight are matrices. Can someone please explain the derivation in the picture to me? Please.

2

There are 2 best solutions below

0
On

I think the answer shoud be...

Let's $$f(w) = \sum^n_{i=1}(y-\hat{y})^2 = \lVert \mathbf{y}-\mathbf{\hat{y}} \rVert^2$$ We know that $\mathbf{\hat{y}} = \frac{1}{1+e^{-Xw}}$.

First of all, please correct me if I am wrong, but I think $\frac{\partial f}{\partial \hat{y}} = -2(\mathbf{y}-\mathbf{\hat{y}} )$.

$\frac{\partial \hat{y}}{\partial z} = \frac{\partial \hat{y}}{\partial (Xw + b)} = \frac{\partial (\mathbf{1} + e^{-Xw})^{-1}}{\partial (Xw + b)} = \frac{e^{-Xw}}{(1+e^{-Xw})^2} = (1-\hat{y})\cdot \hat{y}$.

$\frac{\partial z}{\partial w} = \frac{\partial Xw +b}{\partial w} = x$

$\therefore \frac{\partial f}{\partial w} = -2(y-\hat{y})\cdot (1-\hat{y})\cdot \hat{y} \cdot x$

0
On

The linked image uses the wrong derivative for the sigmoid, and is very sloppy in the way it handles vectors.

Let's use a variable naming convention where an uppercase latin letter is a matrix, lowercase latin is a vector, and a lowercase greek is a scalar $\ldots$ and without any "hats" or other decorations.

Denote the derivative of the scalar sigmoid (aka logistic) function $\sigma(\zeta)$ as $$\eqalign{ \sigma' = \frac{d\sigma}{d\zeta} = (\sigma-\sigma^2) \\ }$$ When applied elementwise on a vector argument, these functions produce vector values $$s=\sigma(z),\qquad s'=\sigma'(z)$$ In this case, it's more convenient to work with the differential rather than the derivative $$\eqalign{ ds &= s'\odot dz = (s-s\odot s)\odot dz \\ }$$ The $\odot$ symbols represent elementwise/Hadamard products, but these can be eliminated in favor of multiplication by the diagonal matrix $\,S={\rm Diag}(s)$ $$\eqalign{ ds &= \left(S-S^2\right) dz \\ }$$ Define some new variables in accordance with our naming convention. $$\eqalign{ z &= Wx+b \quad&\implies dz = dW\,x \\ s &= \hat y = \sigma(z) &\implies ds = \left(S-S^2\right)dz \\ r &= s-y &\implies dr = ds \\ }$$ Write the loss function in terms of these new variables.
Then calculate its differential and gradient. $$\eqalign{ {\cal L} &= r:r \\ d{\cal L} &= 2r:dr \\ &= 2(s-y):\left(S-S^2\right)dz \\ &= 2\left(S-S^2\right)(s-y):dW\,x \\ &= 2\left(S-S^2\right)(s-y)x^T:dW \\ \frac{\partial{\cal L}}{\partial W} &= 2\left(S-S^2\right)(s-y)x^T \\ }$$ where a colon represents the trace/Frobenius product, i.e. $\;A:B = {\rm Tr}(A^TB)$