Hidden Markov Models in part of speech tagging

168 Views Asked by At

I'm trying to prove the following whose context I will describe, but which is not strictly necessary to solving the problem. In natural language processing, specifically in part of speech tagging, given a sequence of words, $x_{1}...x_{n}$, one wishes to tag the words with their parts of speech, say $t_{1}...t_{n}$, and the tag sequence $t_{1}...t_{n}$ should be chosen so as to maximize $P(t_{1}...t_{n}|x_{1}...x_{n})$.

As an example, given the sentence, "John is good", the most likely tag sequence would be something like "NOUN VERB ADJECTIVE". For this particular sentence, most other possible tag sequences would have 0 probability, since the words themselves have few possibilities in what their parts of speech could be.

To find the tag sequence resulting in the maximal $P(t_{1}...t_{n}|x_{1}...x_{n})$, a few simplifying assumptions are made. First note that by Bayes' theorem, finding the tag sequence that maximizes $P(t_{1}...t_{n}|x_{1}...x_{n})$ is equivalent to finding the tag sequence that maximizes $P(x_{1}...x_{n}|t_{1}...t_{n})*P(t_{1}...t_{n})$.

My question concerns the following argument made in the textbook (Speech and LP) by Jurafsky): "The first assumption is that the probability of a word appearing is dependent only on its own part-of-speech tag; that it is independent of other words around it, and independent of the other tags around it."

Thus, $P(x_{1}...x_{n}|t_{1}...t_{n}) = \prod P(x_{n}|t_{n})$

I am triyng to follow why this final equation follows from the simplifying assumptions. It seems obvious, but how does it follow directly from the laws of probability?

1

There are 1 best solutions below

1
On BEST ANSWER

The first step uses the Chain Rule:

\begin{align} P(x_1,\ldots,x_n\mid t_1,\ldots ,t_n) &= P(x_1\mid x_2,\ldots,x_n,t_1,\ldots ,t_n) \\ & \quad\times P(x_2\mid x_3,\ldots,x_n,t_1,\ldots ,t_n) \\ & \quad\times P(x_3\mid x_4,\ldots,x_n,t_1,\ldots ,t_n) \\ & \qquad\cdots \\ & \quad\times P(x_n\mid t_1,\ldots ,t_n) \\ & \\ &= P(x_1\mid t_1) \times P(x_2\mid t_2) \times \ldots \times P(x_n\mid t_n) \\ &\qquad\text{by the conditional independence assumption} \end{align}