Can someone please explain this chain rule based derivation to me?

Question

Can someone please explain this chain rule based derivation to me?

213 Views Asked by Bumbble Comm At 11 May 2026 - 1:59

$$ \text{Loss}(y, \hat{y}) = \sum_{i=1}^n \left( y- \hat{y} \right)^2 $$ $$ \begin{split} \frac{\partial \text{Loss}(y, \hat{y})}{\partial W} &= \frac{\partial \text{Loss}(y, \hat{y})}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial W} \quad \text{where}~z = Wx + b \\ & = 2(y-\hat{y}) \cdot \text{derivative of sigmoid function}\cdot x \\ & = 2(y - \hat{y})~ z(1-z)~ x \end{split} $$

Chain rule for calculating derivative of the loss function with respect to the weights

Sigmoid of $x = \frac{1}{(1 + (e^{-x}))}$

Sigmoid derivative of $x = x\cdot (1-x)$

$y$ here is the required output. $\hat{y}$ here is calculated output.

$\hat{y}$ = sigmoid of (input * weight) where input is $x$ and weight is $W$.

The $\hat{y}$ and $y$ both are $1 \times 1$ matrix. Also, input and weight are matrices. Can someone please explain the derivation in the picture to me? Please.

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2020-03-17 00:34:30

I think the answer shoud be...

Let's $$f(w) = \sum^n_{i=1}(y-\hat{y})^2 = \lVert \mathbf{y}-\mathbf{\hat{y}} \rVert^2$$ We know that $\mathbf{\hat{y}} = \frac{1}{1+e^{-Xw}}$.

First of all, please correct me if I am wrong, but I think $\frac{\partial f}{\partial \hat{y}} = -2(\mathbf{y}-\mathbf{\hat{y}} )$.

$\frac{\partial \hat{y}}{\partial z} = \frac{\partial \hat{y}}{\partial (Xw + b)} = \frac{\partial (\mathbf{1} + e^{-Xw})^{-1}}{\partial (Xw + b)} = \frac{e^{-Xw}}{(1+e^{-Xw})^2} = (1-\hat{y})\cdot \hat{y}$.

$\frac{\partial z}{\partial w} = \frac{\partial Xw +b}{\partial w} = x$

$\therefore \frac{\partial f}{\partial w} = -2(y-\hat{y})\cdot (1-\hat{y})\cdot \hat{y} \cdot x$

**Bumbble Comm** · Answer 2 · 2020-05-26 22:12:53

The linked image uses the wrong derivative for the sigmoid, and is very sloppy in the way it handles vectors.

Let's use a variable naming convention where an uppercase latin letter is a matrix, lowercase latin is a vector, and a lowercase greek is a scalar $\ldots$ and without any "hats" or other decorations.

Denote the derivative of the scalar sigmoid (aka logistic) function $\sigma(\zeta)$ as $$\eqalign{ \sigma' = \frac{d\sigma}{d\zeta} = (\sigma-\sigma^2) \\ }$$ When applied elementwise on a vector argument, these functions produce vector values $$s=\sigma(z),\qquad s'=\sigma'(z)$$ In this case, it's more convenient to work with the differential rather than the derivative $$\eqalign{ ds &= s'\odot dz = (s-s\odot s)\odot dz \\ }$$ The $\odot$ symbols represent elementwise/Hadamard products, but these can be eliminated in favor of multiplication by the diagonal matrix $\,S={\rm Diag}(s)$ $$\eqalign{ ds &= \left(S-S^2\right) dz \\ }$$ Define some new variables in accordance with our naming convention. $$\eqalign{ z &= Wx+b \quad&\implies dz = dW\,x \\ s &= \hat y = \sigma(z) &\implies ds = \left(S-S^2\right)dz \\ r &= s-y &\implies dr = ds \\ }$$ Write the loss function in terms of these new variables.
Then calculate its differential and gradient. $$\eqalign{ {\cal L} &= r:r \\ d{\cal L} &= 2r:dr \\ &= 2(s-y):\left(S-S^2\right)dz \\ &= 2\left(S-S^2\right)(s-y):dW\,x \\ &= 2\left(S-S^2\right)(s-y)x^T:dW \\ \frac{\partial{\cal L}}{\partial W} &= 2\left(S-S^2\right)(s-y)x^T \\ }$$ where a colon represents the trace/Frobenius product, i.e. $\;A:B = {\rm Tr}(A^TB)$

Can someone please explain this chain rule based derivation to me?

There are 2 best solutions below

Related Questions in CALCULUS

Related Questions in MATRICES

Related Questions in MACHINE-LEARNING

Related Questions in CHAIN-RULE

Related Questions in ARTIFICIAL-INTELLIGENCE

Trending Questions

Popular # Hahtags

Popular Questions