Set notation for conditional probabilities

121 Views Asked by At

Consider this definition for the probability that a given document $d$ contains a term $t$ as the relative document frequency:

$$P(t|D)={\frac {|\{d\in D:t\in d\}|}{N}}$$

What does $:$ mean in the numerator here? (i.e. how do you read that expression?), is it "and"? If so, is that equivalent to e.g. a comma?

1

There are 1 best solutions below

0
On

In set builder notation the colon is often read as "such that" or "where". It is somewhat

So, as explained on the wiki page, $\lvert\{d\in D: t\in d\}\rvert$ is "the size of the set of documents ($d$) in the corpus ($D$) where the term ($t$) occurs in the document".

  • The placeholder, $d$ is a local variable for any element of the set which we are constructing.
  • the domain selection, $d\in D$, occurring before the colon (though it may be placed after), indicates we are constructing a set from the elements in $D$ . Here the corpus, $D$, is a collection of documents.
  • the predicate, $t\in d$, occurring after the colon, indicates that we are filtering these elements that have the property of containing $t$. Documents beings a collection of terms.

Likewise, $N$ is the size of the corpus: $N\,{=\lvert D\rvert\\=\lvert\{d\in D\}\rvert}$.

So $\mathsf P(t\mid D)$ is an abbreviation for $\mathsf P(\{d\in D:t\in d\}\mid D)$, and since the set is a subset of the corpus, $$\begin{align}\mathsf P(t\mid D)~&=~\dfrac{\lvert\{d\in D:t\in d\}\cap D \rvert}{\lvert D\rvert}\\&=~\dfrac{\lvert\{d\in D:t\in d\}\rvert}{N}\end{align}$$