I asked this question on Data Science stack exchange, but didn't get any responses there.
I have a (finite) vocabulary which is a metric space, where the metric measures how antonymous the words are. The metric is actually a tree distance, so this is the same as starting with an embedding into an $\ell_1$ space.
I also have a collection of documents in which word order does not matter, and word repetition may or may not be significant in establishing a meaningful notion of inter-document distance. The goal is essentially to cluster the documents, or to find an embedding and metric which could be used to do so.
The embedding of words into $\ell_1$ gives at least two embeddings of documents into $\ell_1$: If $V$ is my vocabulary and $\phi:V\to \ell_1$ is an embedding, I can take a document $d$ to $\sum_{w\in d}\phi(w)$, or $\sum_{w\in V}\#(w,d) \cdot \phi(w)$, where $\#(w,d)$ is the number of times the word $w$ appears in document $d$. The first sum can be re-written as $$\sum_{w\in V}1_d(w)\phi(w)=\sum_{w\in V}\min\{1,\#(w,d)\}\cdot \phi(w),$$ where $1_d(w)=1$ if $w\in d$ and $0$ otherwise. So we have indicators for whether or not $w$ is in $d$, and word counts.
Thus we think of the vocabulary as simple the length $1$-documents and extend $\phi$ in the naive way. This is, in fact, just the extension of a map from a basis of a free vector space into $\ell_1$ to the entire free vector space, but that doesn't seem relevant.
However, my approach seems like it could be somewhat naive. Are there existing, standard approaches to this kind of problem? Are there ways to improve the embedding of the set of documents?