Confusion on the use of chain-rule for the total-derivative of NLL Loss

66 Views Asked by At

So my question is about when we want to find the total derivative of the NLL Loss function $L$ w.r.t. $w_i$.

So the "pipeline" is often expressed as: $$\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w_i}$$

Where $L$ is the NLL Loss function, $w_i$ is the context word-embedding vector, and $z$ is NOT the softmax function, but instead the input of the softmax function, which is the "function" that is the dot-product between the context word-embedding $w_i$ and the target word-embedding $t$.

Now this is where my confusion comes in, in a more classical view, I see the NLL Loss function, $L$, as nested composite functions:

$$L(s(z(w_i,t))) = -\log(s(z(w_i,t)))$$

So if I were asked to take the total derivative of $L$ w.r.t. $w_1$, I would instinctively go through all the composite functions applying the chain-rule: $$\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial s(z)} \cdot \frac{\partial s(z)}{\partial z(w_i,t)} \cdot \frac{\partial z(w_i,t)}{\partial w_i}$$

But noone seems to do it this way? They always seem to "skip" the derivation of the softmax function, $s(z)$ and they go straight to taking the derivative of $L$ with respect to $z$. My only thought on why this could be, is because they are not treating the softmax function as a function. In the sense that its "innards" are put into the NLL Loss function, not as a variable function $s(z)$. Like so: $$L(z(w_i,t)) = -\log(\frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}})$$