Say that we want to extend an arbitrary predictor to multi class classification. Say that the predictor $h_{\theta}(x)$ outputs a vector that depends on each label. In fact one can get the "confidence margin" for each label by indexing the parameters as follow: $h_{\theta_l}(x)$. Using this "confidence" one can get the probability of a label as follows:
$$ p( y = l \mid x ; \theta ) = \frac{ e^{ h_{\theta_l}(x) }}{ \sum^L_{l'=1} e^{ h_{\theta_{l'}}(x) }} = \frac{ e^{ h_{\theta_l}(x) } }{z(x;\theta)}$$
armed with this model and notation, then our goal is to compute the derivative of the cross entropy loss:
$$ J(x,y; \theta) = - \sum^L_{l=1} \mathbb{1}\{ y =l\} \log p(y=l \mid x; \theta )$$
with respect to any subset of the parameters $\theta$.
I claim that the derivative is as follow:
$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = \sum^L_{l=1} \frac{\partial h_{\theta_l}(x) }{\partial t_k} [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$
and this is the derivative I'd like to check with the community. This is my derivation:
$$ J(x,y; \theta) = - \sum^L_{l=1} \mathbb{1}\{ y =l\} \log p(y=l \mid x; \theta ) = - \sum^L_{l=1} \left[ \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \mathbb{1}\{ y = l\} \log z(x; \theta) \right] $$
$$ = -\sum^L_{l=1} \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \left[ \sum^L_{l=1} \mathbb{1}\{ y = l\} \right] \log z(x; \theta)= \sum^L_{l=1} \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \log z(x; \theta) $$
where $ -\left[ \sum^L_{l=1} \mathbb{1}\{ y = l\} \right] = 1$ because the label of any valid data point has to match one and only one of the labels $l$. Now taking the derivative we have:
$$ -\sum^L_{l=1} \mathbb{1}\{ y = l\} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} - \frac{\partial }{\partial \theta }\log z(x; \theta) $$
using the chain rule on $z(x; \theta)$ we get:
$$ -\sum^L_{l=1} \mathbb{1}\{ y = l\} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} - \frac{1}{z(x; \theta)} \sum^L_{l=1} e^{h_{\theta_l}(x)} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} $$
factoring $\frac{ \partial h_{\theta_l}(x)}{\partial \theta}$ and collecting the summations gives:
$$ -\sum^L_{l=1} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} \left[ \mathbb{1}\{ y = l\} - \frac{e^{h_{\theta_l}(x)}}{z(x; \theta)} \right] $$
but we defined $\frac{e^{h_{\theta_l}(x)}}{z(x; \theta)} = p( y = l \mid x ; \theta ) $
so we get my final answer:
$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = - \sum^L_{l=1} \frac{\partial h_{\theta_l}(x) }{\partial t_k} [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$
Its actually quite nice to be able to confirm that this seems to agree with what Andrew Ng has when the model is just linear i.e. when $h_{ \theta_l }(x) = {\theta^{(l)}}^T x$:
$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = - \sum^L_{l=1} x [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$
The only reason I sort of doubt my derivation is because I cannot get code based on this match my numerical derivatives. I made a question addressing my code in stack overflow here.