What is the derivative of the cross entropy loss when extending an arbitrary predictor for multi class classification?

882 Views Asked by At

Say that we want to extend an arbitrary predictor to multi class classification. Say that the predictor $h_{\theta}(x)$ outputs a vector that depends on each label. In fact one can get the "confidence margin" for each label by indexing the parameters as follow: $h_{\theta_l}(x)$. Using this "confidence" one can get the probability of a label as follows:

$$ p( y = l \mid x ; \theta ) = \frac{ e^{ h_{\theta_l}(x) }}{ \sum^L_{l'=1} e^{ h_{\theta_{l'}}(x) }} = \frac{ e^{ h_{\theta_l}(x) } }{z(x;\theta)}$$

armed with this model and notation, then our goal is to compute the derivative of the cross entropy loss:

$$ J(x,y; \theta) = - \sum^L_{l=1} \mathbb{1}\{ y =l\} \log p(y=l \mid x; \theta )$$

with respect to any subset of the parameters $\theta$.

I claim that the derivative is as follow:

$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = \sum^L_{l=1} \frac{\partial h_{\theta_l}(x) }{\partial t_k} [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$

and this is the derivative I'd like to check with the community. This is my derivation:

$$ J(x,y; \theta) = - \sum^L_{l=1} \mathbb{1}\{ y =l\} \log p(y=l \mid x; \theta ) = - \sum^L_{l=1} \left[ \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \mathbb{1}\{ y = l\} \log z(x; \theta) \right] $$

$$ = -\sum^L_{l=1} \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \left[ \sum^L_{l=1} \mathbb{1}\{ y = l\} \right] \log z(x; \theta)= \sum^L_{l=1} \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \log z(x; \theta) $$

where $ -\left[ \sum^L_{l=1} \mathbb{1}\{ y = l\} \right] = 1$ because the label of any valid data point has to match one and only one of the labels $l$. Now taking the derivative we have:

$$ -\sum^L_{l=1} \mathbb{1}\{ y = l\} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} - \frac{\partial }{\partial \theta }\log z(x; \theta) $$

using the chain rule on $z(x; \theta)$ we get:

$$ -\sum^L_{l=1} \mathbb{1}\{ y = l\} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} - \frac{1}{z(x; \theta)} \sum^L_{l=1} e^{h_{\theta_l}(x)} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} $$

factoring $\frac{ \partial h_{\theta_l}(x)}{\partial \theta}$ and collecting the summations gives:

$$ -\sum^L_{l=1} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} \left[ \mathbb{1}\{ y = l\} - \frac{e^{h_{\theta_l}(x)}}{z(x; \theta)} \right] $$

but we defined $\frac{e^{h_{\theta_l}(x)}}{z(x; \theta)} = p( y = l \mid x ; \theta ) $

so we get my final answer:

$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = - \sum^L_{l=1} \frac{\partial h_{\theta_l}(x) }{\partial t_k} [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$

Its actually quite nice to be able to confirm that this seems to agree with what Andrew Ng has when the model is just linear i.e. when $h_{ \theta_l }(x) = {\theta^{(l)}}^T x$:

$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = - \sum^L_{l=1} x [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$

The only reason I sort of doubt my derivation is because I cannot get code based on this match my numerical derivatives. I made a question addressing my code in stack overflow here.