If I build a word count on a corpus of text, are the atoms of this distribution the unique words themselves?
I ask because the definitions I read of atoms in a probability distribution (such as here) are too abstract.
It's the excellent answer to this StackExchange question that prompts me to ask. It gives an equation "where |x| is the number of atoms in the discrete distribution". Interpreting atoms as the set of words in the corpus is certainly consistent with the context but I can't find anywhere that says this definitively.
Similarly, my interpretation that atoms as the possible outcomes for any probability distribution is supported by this StackExchange question but I am still not sure.
Can anybody please assure me?
The atoms of a measure space are, in some sense, the smallest non-null sets in the space. Any set contained within an atom has zero measure, and a set is not an atom if it contains a measurable set that has non-zero measure.
For counting the words in the corpus you are also correct that distinct words are atoms: though words may contain other words (for example, the appears inside other) they do not contain distinct words (surrounded by non-word characters) for the purposes of word-count$[1]$. In such a corpus, words are atoms, and then phrases or sentences are built out of atoms; paragraphs out of phrases and sentences, etc.
You don't say, but I suspect, that you're planning on using a Bayesian classifier, in which case words are definitely the right atom to select.
$[1]$ Language is tricky: a hyphenated word can certainly be claimed to contain at least two distinct words, and some languages use symbols as "letters" (e.g. representing a glottal stop as an apostrophe) that might be thought to be non-word characters in other languages. But for word-count, you can almost certainly treat a distinct word as: being at the start of a line and followed by a punctuation mark or whitespace, being both preceded and succeeded by punctuation marks and/or whitespace, or being at the end of a line and preceded by punctuation or whitespace. You will have to decide if you want to special-case the hyphen as the above rules will treat the hyphenated word as two (or more). You might also want to note English has a couple of special case words as well, like stun's'l and fo'c'sle that will get turned into three words.