What is the derivation of the derivative of softmax regression (or multinomial logistic regression)?

5.8k Views Asked by At

Consider the training cost for softmax regression (I will use the term multinomial logistic regression):

$$ J( \theta ) = - \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

according to the UFLDL tutorial the derivative of the above function is:

$$ \bigtriangledown_{ \theta^{(k)} }J( \theta ) = -\sum^{m}_{i=1} [x^{(i)} (1 \{ y^{(i)} = k \} - p(y^{(i)} = k \mid x^{(i)} ; \theta) ) ] $$

however, they didn't include the derivation. Does someone know what the derivation is?

I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.

So I first took the gradient $\bigtriangledown_{ \theta^{(k)} }J( \theta )$ as they suggested:

$$ \bigtriangledown_{ \theta^{(k)} } J( \theta ) = - \bigtriangledown_{ \theta^{(k)} } \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

but since we are taking the gradient with respect to $\theta^{(k)}$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:

$$ \bigtriangledown_{ \theta^{(k)} } J( \theta ) = - \sum^m_{i=1} \bigtriangledown_{ \theta^{(k)} } \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

then if we proceed we get:

$$ - \sum^m_{i=1} \frac{1}{p(y^{(i)} = k \mid x^{(i)} ; \theta)} \bigtriangledown_{ \theta^{(k)} } p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?