What does a predicted distribution look like for a word2vec naive-softmax loss?

92 Views Asked by At

I'm having trouble visualizing what a word2vec predicted distribution actually is. This is my best attempt at walking through the meaning of the distribution with an example.

The conditional probability distribution of a single $word_{center}$ $word_{context}$ pair is expressed like this:

$$ P(O = o | C = c) = \frac{exp(u_o^⊤ v_c)}{\sum_{w\in Vocab}exp(u_w^⊤ v_c)} $$

where $ v_c $ is a vector representing the center word, and Vocab is the list of words in the given window. For example, given the sentence "my dog fido", the center word "dog", the window size 1, and the vectors

$$ \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 0 \\ 1 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 0 \\ \end{bmatrix} $$

representing the context, center, and context vectors of "my", "dog", and "fido", respectively, $ P(my | dog) $ would be $\frac{2}{2 + 2 + 1}$.

The naive-softmax is expressed like this: $$ J_{naive-softmax}(v_c, o, U) = -logP(O = o|C = c) $$

So the $J_{naive-softmax}(my, dog, U) = -log(2/5)$ = 0.3979. My understanding is that the naive-softmax loss here would be a single number. Equation (3) on the document, however, implies that $J_{naive-softmax}$ should be a predicted distribution $\hat{y}$ that can be compared to a empirical distribution represented by a one-hot vector $y$. What am I missing here? Is $\hat{y}$ supposed to be a vector containing three $J_{naive-softmax}(v_c, o, U)$ calulations (one for each word in the sentence)?