pick the most relevant word for each category of texts : Bayesian approach or?

32 Views Asked by At

10 years after taking an probability theory, now I am dealing with a practical situation. I think my problem is quite basic, but I was stuck and am writing a post here.

Consider the set of vocabularies $V$ and the set of documents $D$. Each document is categorized into 10 categories $\{c_1, \cdots, c_{10}\}$. $i \ne j \implies c_i \cap c_j = \varnothing$, and $\cup_{i =1}^{10}c_i = D$. What I want to do is to pick up some word sets which are most relevant to certain category.

My first idea is to write down a conditional probability $f(v, d) = P(d \in c_0 | v \in d)$ where $v$ is a word in $V$ and $d \in D$.

Here my intend is to calculuate $P(v \in d)$ based on counting, as $P(v \in d) = (\text{number of times $v$ appeared in $d$})/|d|$.

I think the cond. prob. $f(v, d)$ represents the probability of document $d$ being member of category $c_1$ under the condition of $v \in d$. My strategy is to pick up word $v$ of highest values $f_1(v) = E_d[P(d \in c_1 | v \in d)]$.

But what does this equation mean? I gave it a try to write down:

\begin{align*} f_1(v) & = \frac{1}{|D|} \sum_{d \in D} P(d \in c_1 | w \in d) = \frac{1}{|D|} \sum_{d \in D} \dfrac{P(d \in c_1 \text{ and } w \in d)}{P(w \in d)} = \\ & = \frac{1}{|D|} \sum_{d \in D} \dfrac{P(w \in d | d \in c_1 )P(d \in c_1)}{P(w \in d)}\\ & = \frac{1}{|D|} \sum_{d \in c_1} \dfrac{P(w \in d | d \in c_1 )}{P(w \in d)} = \frac{1}{|D|} \sum_{d \in c_1} \dfrac{P(w \in d )}{P(w \in d)} = |c_1|/|D| \end{align*} which is an absurd.

Could somebody give me an advice? Thanks.