I'm currently learning about neural networks and stumbled upon a confusion related to the use of Stochastic Gradient Descent (SGD) in training. Specifically, I'm puzzled about the computation of the partial derivative of the cross-entropy loss with respect to the predicted probabilities. Here's where my confusion lies:
Why is it that $ \frac{\partial}{\partial f(\mathbf{x})_c}(-\log f(\mathbf{x})_y) = \frac{-1_{(y=c)}}{f(\mathbf{x})_y} \quad$? ($1_{(y=c)}=1$if $y=c$, otherwise 0)
Given that $ f(\mathbf{x})_c = p(y=c|\mathbf{x}) $ and knowing that the sum of probabilities across all classes equals one, $ \sum_c f(\mathbf{x})_c = p(y=c|\mathbf{x}) = 1 $, it seems there should be a relationship between the derivatives across different classes. Thus, shouldn't the derivative $ \frac{\partial}{\partial f(\mathbf{x})_c}(-\log f(\mathbf{x})_y) $ be equivalent to $ \frac{\partial}{\partial f(\mathbf{x})_c}( (1-\sum_{c' \neq y}f(\mathbf{x})_{c'}))=-\frac{1}{ f(\mathbf{x})_y}\quad\frac{\partial}{\partial f(\mathbf{x})_c}( (1-\sum_{c' \neq y}f(\mathbf{x})_{c'})) $? And wouldn't this not equal zero (for $c\neq y$, this equals to -1), thereby presenting a contradiction?
I'm trying to wrap my head around this concept and would greatly appreciate any insights or explanations you might offer. Thank you!