Consider the training cost for softmax regression (I will use the term multinomial logistic regression):
$$ J( \theta ) = - \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$
according to the UFLDL tutorial the derivative of the above function is:
$$ \bigtriangledown_{ \theta^{(k)} }J( \theta ) = -\sum^{m}_{i=1} [x^{(i)} (1 \{ y^{(i)} = k \} - p(y^{(i)} = k \mid x^{(i)} ; \theta) ) ] $$
however, they didn't include the derivation. Does someone know what the derivation is?
I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.
So I first took the gradient $\bigtriangledown_{ \theta^{(k)} }J( \theta )$ as they suggested:
$$ \bigtriangledown_{ \theta^{(k)} } J( \theta ) = - \bigtriangledown_{ \theta^{(k)} } \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$
but since we are taking the gradient with respect to $\theta^{(k)}$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:
$$ \bigtriangledown_{ \theta^{(k)} } J( \theta ) = - \sum^m_{i=1} \bigtriangledown_{ \theta^{(k)} } \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$
then if we proceed we get:
$$ - \sum^m_{i=1} \frac{1}{p(y^{(i)} = k \mid x^{(i)} ; \theta)} \bigtriangledown_{ \theta^{(k)} } p(y^{(i)} = k \mid x^{(i)} ; \theta) $$
however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?