I'm particularly referring to Lecture 2 of Stanford's CS224n: Natural Language Processing with Deep Learning.
Professor Chris Manning writes this equation on the board (related to softmax), and says its an application of the chain rule.
$$ \frac{\partial}{\partial v_c} \log (\sum_{w=1}^v \exp (u_w^Tv_c)) = \frac{1}{\sum_{w=1}^v \exp (u_w^Tv_c)}*\frac{\partial}{\partial v_c}\sum_{x=1}^v \exp (u_x^Tv_c)$$
I understand this, as applying the chain rule in this case would be simply $\frac{f'(x)}{f(x)} $,since it's a $\log$ function, then multiplying this with the derivative of the inner function.
My question is, How is the numerator of the left term 1?
$$ \frac{1}{\sum_{w=1}^v \exp (u_w^Tv_c)}$$ I.e. how does differentiating
$$\sum_{w=1}^v \exp (u_w^Tv_c)$$
render the numerator as 1?
Thank you! :)
I believe that he hasn't yet taken the ''second'' chain rule derivative (i.e. the derivative of $\sum_{w=1}^v \exp (u_w^Tv_c))$ at this step. As you mention, the ''outer'' derivative corresponding to the $\log$ would give: \begin{align*} \dfrac{\partial}{\partial v_c} \log(f(v_c)) = \dfrac{\dfrac{\partial}{\partial v_c} f(v_c)}{f(v_c)} = \dfrac{1}{f(v_c)}\cdot \dfrac{\partial}{\partial v_c} f(v_c). \end{align*} Now, replace $f(v_c)$ with $f(v_c)=\sum_{w=1}^{v} \exp(u_w^Tv_c)$ in the above expression and you'll arrive at the step you've described.