What does a predicted distribution look like for a word2vec naive-softmax loss?

92 Views Asked by Bumbble Comm At 07 Apr 2026 - 10:56

I'm having trouble visualizing what a word2vec predicted distribution actually is. This is my best attempt at walking through the meaning of the distribution with an example.

The conditional probability distribution of a single $word_{center}$ $word_{context}$ pair is expressed like this:

$$ P(O = o | C = c) = \frac{exp(u_o^⊤ v_c)}{\sum_{w\in Vocab}exp(u_w^⊤ v_c)} $$

where $ v_c $ is a vector representing the center word, and Vocab is the list of words in the given window. For example, given the sentence "my dog fido", the center word "dog", the window size 1, and the vectors

$$ \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 0 \\ 1 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 0 \\ \end{bmatrix} $$

representing the context, center, and context vectors of "my", "dog", and "fido", respectively, $ P(my | dog) $ would be $\frac{2}{2 + 2 + 1}$.

The naive-softmax is expressed like this: $$ J_{naive-softmax}(v_c, o, U) = -logP(O = o|C = c) $$

So the $J_{naive-softmax}(my, dog, U) = -log(2/5)$ = 0.3979. My understanding is that the naive-softmax loss here would be a single number. Equation (3) on the document, however, implies that $J_{naive-softmax}$ should be a predicted distribution $\hat{y}$ that can be compared to a empirical distribution represented by a one-hot vector $y$. What am I missing here? Is $\hat{y}$ supposed to be a vector containing three $J_{naive-softmax}(v_c, o, U)$ calulations (one for each word in the sentence)?

Original Q&A

What does a predicted distribution look like for a word2vec naive-softmax loss?

Related Questions in MATRICES

Related Questions in CONDITIONAL-PROBABILITY

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions