Quality of Embedding Subject to Expected Cosine Distance

30 Views Asked by At

Full disclosure, I posted a related question on Cross Validated. Here I want to focus on a different aspect though.


I want to measure the quality of word embeddings while taking into account the expected value for the measure of quality I chose. Let $V$ denote the set of vectors from a given embedding.

Say I choose the cosine distance between two vectors $\mathbf{v}, \mathbf{w} \in V$, which is defined as

$$\operatorname{cosdist}(\mathbf{v}, \mathbf{w}) = 1 - \frac{\langle \mathbf{v}, \mathbf{w} \rangle}{||\mathbf{v}||~||\mathbf{w}||}$$

With the cosine distance I would evaluate $n$ tuples $(a, b, c, d)$ of the form

a            b            c          d

Philadelphia Pennsylvania Louisville Kentucky
implement    implementing fly        flying
efficient    efficiently  happy      happily

where $a$ to $b$ is like $c$ to $d$, by computing the average cosine distance between $\mathbf{v} = \mathbf{c} + (\mathbf{b} - \mathbf{a})$ and $\mathbf{d}$.

$$\frac{1}{n}\sum_{i=1}^n \operatorname{cosdist}(\mathbf{v}, \mathbf{d})$$

$\mathbf{d}$ is basically the "target" vector of $\mathbf{v}$. The closer the average distance to the target vector, the better your embedding.

However, that average value will not take into account the expected distance, which is the average between any two vectors in the embedding. (For random sets of vectors that value tends towards $.25$ for growing number of dimensions or number of vectors)

Okay, my question now is, how do I know how to correctly bring the average cosine distance of $\mathbf{v}$ and $\mathbf{d}$ into relation with the expected cosine distance between any $\mathbf{w}, \mathbf{u} \in V$.

Is $\frac{\text{expected distance} - \text{average distance of $v$ and $d$}}{\text{expected distance}}$ a correct measure? Or how can I determine what a correct or sensible measure is?