How is this paper using probability notation?

135 Views Asked by At

I am trying to understand this paper about documents and sentences. At the end of page three, they say:

Let g(wi, wj ) be the distance between two events (1 if in the same sentence, 2 in neighboring, etc). Let Cdist(wi, wj ) be the distance-weighted frequency of two events occurring together, where D is all documents.

enter image description here

I understand that part. I see how the cumulitive distance of all events in all documents is generated by that equation.

But then they say:

enter image description here

I don't understand how they generate this because I don't understand the notation. I understand the concept of point-wise mutual information (PMI). But what does the numerator mean in statement 2? The probability of the distance between the two events? I don't understand what that means. Can someone explain statements 2 - 4?

1

There are 1 best solutions below

0
On

Lines $(3)$ and $(4)$ provide the definition for the notation in line $(2)$ for $P_{dist}()$ and $P()$. The only function/notation left undefined in these lines is $C(w_i)$ in line $(3)$, but I'm pretty sure this is just the count of $w_i$ in the documents (i.e. the number of times $w_i$ occurs).

To explain what I think is going on here, I'll start by substituting those definitions from $(3)$ and $(4)$ into $(2)$ and then do some re-arranging:

\begin{eqnarray*} pmi(w_i, w_j) &=& \dfrac{C_{dist}(w_i, w_j)}{\sum_{k}{\sum_{l}{C_{dist}(w_k, w_l)}}} \Bigg/ \dfrac{C(w_i) C(w_j)}{\sum_{k}{C(w_k)} \sum_{l}{C(w_l)}} \\ &=& \dfrac{C_{dist}(w_i, w_j)}{C(w_i) C(w_j)} \Bigg/ \dfrac{\sum_{k}{\sum_{l}{C_{dist}(w_k, w_l)}}}{\sum_{k}{C(w_k)} \sum_{l}{C(w_l)}} \end{eqnarray*}

The numerator,

$$\dfrac{C_{dist}(w_i, w_j)}{C(w_i) C(w_j)}$$

is just the average "distance", as measured by the $C_{dist}$ function, per occurrence of the pair of words $w_i$ and $w_j$. It makes sense to do this because $C_{dist}(w_i, w_j)$ itself is a "frequency" rather than a distance - it increases simply by having more occurrences of words $w_i$ and $w_j$.

The denominator,

$$\dfrac{\sum_{k}{\sum_{l}{C_{dist}(w_k, w_l)}}}{\sum_{k}{C(w_k)} \sum_{k}{C(w_k)}}$$

is somewhat similar. It is the "distance" per occurrence of word-pairs averaged over all words ($w_k, w_l$) in the documents. Note that $\sum_{k}{C(w_k)} = \sum_{l}{C(w_l)}$ is the total number of words in the documents.

So the PMI itself, being the quotient of these two values, is a measure of the "distance" of words $w_i$ and $w_j$ compared to the overall average "distance".