Unsure if I computed partial derivative of Loss function correctly

37 Views Asked by At

Merry Christmas and happy holidays everyone! I want to make a quick preface to this question, to say that this is about computing the gradient (derivative) of the loss with respect to specific word-embeddings in a Word2Vec model. I do not go into specifics on what the variables mean in my loss function because I feel it is not needed to help with my question, but if you feel like the context would help answer it anyway, I would gladly expand on it!

QUESTION: So I have a loss function, $L$, that looks like:

$$L(z(w,t)) = -\log(\frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}})$$

where $z$ is the dot-product between two column vectors, $w$, and $t$. So, $$z = w \cdot t$$

I want to take the derivative of $L$ w.r.t. $w_i$ and $t$, individually. The numerator $z$ is $z_i = w_i \cdot t$. And the denominator $z_j$ is the sum of all dot-products of $w$ with $t$. This means there is exactly one $w_i$ in the denominator $z_j$ where $j = i$.

My computed gradient for $\frac{\partial L}{\partial w_i}$ looks like so:

gradient of L w.r.t. w_i

Where $p_i$ is the output probability scalar from the SoftMax function for a specific column vector $w_i$, and is used for simplicity:

$$\text{SoftMax} = p_i = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}$$

I'm skipping through my work for that derivation, so I can show my work for $\frac{\partial L}{\partial t}$, and see if I did it correctly. It was more complicated so I am slightly unsure. I used MS-paint and MathJax to make it more readable:

p.1. gradient of L w.r.t. t

p.2. gradient of L w.r.t. t

p.3. gradient of L w.r.t. t

p.4. gradient of L w.r.t. t

p.5. gradient of L w.r.t. t

Does this look correct? Or am I doing something extremely wrong here? I am using a naive-softmax approach, and so I have not found many resources for cross-validation on the internet that look sufficiently like mine.