Question on the Partial Derivative of the Cross-Entropy Loss in SGD for Neural Networks

17 Views Asked by Bumbble Comm At 23 Apr 2026 - 2:55

I'm currently learning about neural networks and stumbled upon a confusion related to the use of Stochastic Gradient Descent (SGD) in training. Specifically, I'm puzzled about the computation of the partial derivative of the cross-entropy loss with respect to the predicted probabilities. Here's where my confusion lies:

Why is it that $ \frac{\partial}{\partial f(\mathbf{x})_c}(-\log f(\mathbf{x})_y) = \frac{-1_{(y=c)}}{f(\mathbf{x})_y} \quad$? ($1_{(y=c)}=1$if $y=c$, otherwise 0)

Given that $ f(\mathbf{x})_c = p(y=c|\mathbf{x}) $ and knowing that the sum of probabilities across all classes equals one, $ \sum_c f(\mathbf{x})_c = p(y=c|\mathbf{x}) = 1 $, it seems there should be a relationship between the derivatives across different classes. Thus, shouldn't the derivative $ \frac{\partial}{\partial f(\mathbf{x})_c}(-\log f(\mathbf{x})_y) $ be equivalent to $ \frac{\partial}{\partial f(\mathbf{x})_c}( (1-\sum_{c' \neq y}f(\mathbf{x})_{c'}))=-\frac{1}{ f(\mathbf{x})_y}\quad\frac{\partial}{\partial f(\mathbf{x})_c}( (1-\sum_{c' \neq y}f(\mathbf{x})_{c'})) $? And wouldn't this not equal zero (for $c\neq y$, this equals to -1), thereby presenting a contradiction?

I'm trying to wrap my head around this concept and would greatly appreciate any insights or explanations you might offer. Thank you!

Original Q&A

Question on the Partial Derivative of the Cross-Entropy Loss in SGD for Neural Networks

Related Questions in PARTIAL-DERIVATIVE

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions