If I've got a function
$$ \log p(\tau | \theta) = \log ( \frac{\exp(\theta^T \tau)}{\sum_\tau \exp(\theta^T\tau)} ) $$
how do I calculate its derivative to maximize the log-likelihood?
$$\log p(\tau | \theta) = \theta^T \tau - \log( \sum_\tau \exp (\theta^T \tau) $$ Using the chain rule $$ u = \log(v)$$ $$ \frac{du}{dv} = 1 / v$$
$$ v = \sum_\tau \exp(w) $$ $$ \frac{dv}{dw} = \sum_\tau \exp(w) $$
$$ w = \theta^T \tau $$ $$ \frac{dw}{d\theta} = \tau $$
so leaves me with
$$ \frac{du}{d\theta} = \frac{du}{dv} \cdot \frac{dv}{dw} \cdot \frac{dw}{d\theta} = \frac{1}{\sum_\tau \exp(\theta^T \tau)} \cdot \sum_\tau exp(\theta^T \tau) \cdot \tau = \tau$$
From the answer sheet this is wrong but I'm not entirely sure why? Can someone point out the mistake?
Thanks
You write $\sum_\tau \exp(\theta^\top \tau)$, which implies that $\tau $ is just a dummy variable ranging over some set. Your final answer cannot involve $\tau$.
You are correct right up until $$ \frac{\sum_\tau \exp(\theta^\top \tau)\tau}{\sum_\tau \exp(\theta^\top \tau)}. $$ However, you cannot "pull the $\tau$ out of the top summation" because $\tau$ is not a constant with respect to the summation index, $\tau$. Therefore, the above expression is as simple as it gets (without additional information).
Edit: I see now the correct definition of the function you are trying to differentiate is $$ \theta^\top\tau-\log \sum_{\tau} \exp(\theta^\top \tau). $$ Notice that $\tau$ is playing two roles here; one as a summation index, and the other as a fixed vector. This can cause confusion. To make things clearer, I will use $\sigma$ for the summation index: $$ \theta^\top\tau-\log \sum_{\sigma} \exp(\theta^\top \sigma) $$ Using my previous explanation, the simplest form of the derivative of this is $$ \tau-\frac{\sum_\sigma \exp(\theta^\top \sigma)\sigma}{\sum_\sigma \exp(\theta^\top \sigma)}. $$