Differentiating logsumexp

4.6k Views Asked by At

If I've got a function

$$ \log p(\tau | \theta) = \log ( \frac{\exp(\theta^T \tau)}{\sum_\tau \exp(\theta^T\tau)} ) $$

how do I calculate its derivative to maximize the log-likelihood?

$$\log p(\tau | \theta) = \theta^T \tau - \log( \sum_\tau \exp (\theta^T \tau) $$ Using the chain rule $$ u = \log(v)$$ $$ \frac{du}{dv} = 1 / v$$

$$ v = \sum_\tau \exp(w) $$ $$ \frac{dv}{dw} = \sum_\tau \exp(w) $$

$$ w = \theta^T \tau $$ $$ \frac{dw}{d\theta} = \tau $$

so leaves me with

$$ \frac{du}{d\theta} = \frac{du}{dv} \cdot \frac{dv}{dw} \cdot \frac{dw}{d\theta} = \frac{1}{\sum_\tau \exp(\theta^T \tau)} \cdot \sum_\tau exp(\theta^T \tau) \cdot \tau = \tau$$

From the answer sheet this is wrong but I'm not entirely sure why? Can someone point out the mistake?

Thanks

2

There are 2 best solutions below

0
On BEST ANSWER

You write $\sum_\tau \exp(\theta^\top \tau)$, which implies that $\tau $ is just a dummy variable ranging over some set. Your final answer cannot involve $\tau$.

You are correct right up until $$ \frac{\sum_\tau \exp(\theta^\top \tau)\tau}{\sum_\tau \exp(\theta^\top \tau)}. $$ However, you cannot "pull the $\tau$ out of the top summation" because $\tau$ is not a constant with respect to the summation index, $\tau$. Therefore, the above expression is as simple as it gets (without additional information).

Edit: I see now the correct definition of the function you are trying to differentiate is $$ \theta^\top\tau-\log \sum_{\tau} \exp(\theta^\top \tau). $$ Notice that $\tau$ is playing two roles here; one as a summation index, and the other as a fixed vector. This can cause confusion. To make things clearer, I will use $\sigma$ for the summation index: $$ \theta^\top\tau-\log \sum_{\sigma} \exp(\theta^\top \sigma) $$ Using my previous explanation, the simplest form of the derivative of this is $$ \tau-\frac{\sum_\sigma \exp(\theta^\top \sigma)\sigma}{\sum_\sigma \exp(\theta^\top \sigma)}. $$

0
On

$$\nabla\left(\log\sum_i\exp(\theta^T\tau_i)\right)=\frac{\sum_i\exp(\theta^T\tau_i)\tau_i}{\sum_i\exp(\theta^T\tau_i)}.$$