Gradient Descent Loss Differentiation

85 Views Asked by Bumbble Comm At 13 Apr 2026 - 3:40

I'm learning about gradient descent, and I think I've got the general jist of the partial differentiation behind it, however I'm a bit confused by one bit.

When doing:

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial Y} \frac{\partial Y}{\partial a} \frac{\partial a}{\partial w} $$

to compute a new weight, should you be differentiating the loss, $L$, in respect to $Y$ or in respect to $\hat{Y}$ - i.e. in respect to the true $Y$ value or the predicted.

Online I've seen it done both ways, producing a partial derivative of MSE that is either negative or positive.

Which should it be?

Original Q&A

There are 1 best solutions below

Bumbble Comm On 19 Jan 2023 - 11:46

You ought to apply the chain rule. If a function such as a loss function has multiple inputs, that's effectively the same as having a single input which is a tuple:

$$ℓ：ℝ^n ⊕ ℝ^n ⟶ ℝ，(y，\hat{y}) ⟼ ℓ(y，\hat{y})$$

Now, here, $\hat{y} = \hat{y}(x，w)$ is the model that depends both on inputs $x$ and parameters $w$. Therefore, following the chain rule strictly by definition we have:

$$\begin{aligned} \frac{∂ℓ}{∂w} &= \frac{∂ℓ}{∂(y，\hat{y})} ⋅ \frac{∂(y，\hat{y})}{∂w} \\ &= \begin{bmatrix}\frac{∂ℓ}{∂y} & \frac{∂ℓ}{∂\hat{y}}\end{bmatrix} ⋅ \begin{bmatrix}\frac{∂y}{∂w} \\ \frac{∂\hat{y}}{∂w}\end{bmatrix} \\ &= \frac{∂ℓ}{∂y}\frac{∂y}{∂w} + \frac{∂ℓ}{∂\hat{y}}\frac{∂\hat{y}}{∂w} \end{aligned}$$

But since $y$ does not depend on $w$, the derivative $\frac{∂y}{∂w}$ is zero, and the first term vanishes.

Gradient Descent Loss Differentiation

There are 1 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in ALGORITHMS

Related Questions in PARTIAL-DERIVATIVE

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions