Chain rule and Uniform distribution

111 Views Asked by At

Given a sequence of words $w_{1:n}$ (short for $(w_1,w_2,\ldots,w_n)$), we use the chain rule to compute its probability as follows:

$$ P(w_{1:n}) = P(w_1) \prod_{i=2}^n P(w_i|w_{1:i-1}) $$

Suppose the words are confined to a vocabulary V, i.e., $w \in V$ and that we don't have any information about the distribution of the words. What can we say about the probability of an arbitrary sequence of length n?

I came up with two answers, but they don't agree. I suspect (Attempt A) is wrong, but I don't know what step is mistaken. I'd be grateful for any help with this.

Attempt A

We compute the "unigram" term $P(w)$ as follows:

$$ P(w) = \frac{1}{|V|} $$

Because we don't know anything about particular words, we assume they're uniformly distributed. In a similar fashion we compute the "bigram" term $P(w_i|w_{i-1})$ as follows:

$$ P(w_i|w_{i-1}) = \frac{1}{|V|^2} $$

Since there are $|V|^2$ bigrams, the probability is again the ratio of 1 over the entire space.

Carrying this out for all indices and plugging into the chain rule formula we get:

$$ \begin{align} P(w_{1:n}) &= \prod_{i=1}^n \frac{1}{|V|^i}\\ P(w_{1:n}) &= \frac{1}{|V|^1} \cdot \frac{1}{|V|^2} \cdot \cdots \cdot \frac{1}{|V|^n}\\ P(w_{1:n}) &= \frac{1}{|V|^{1 + 2 + \ldots + n}}\\ P(w_{1:n}) &= \frac{1}{|V|^{\frac{n(n+1)}{2}}}\\ \end{align} $$

Attempt B

This second attempt is the one usually found in the literature (the context is that of language modeling). Again we have no information about the words and assume a uniform distribution. The key here is that when computing the probability of a word $w_i$ given its history $w_{1:i-1}$, we make the following conditional independence assumption:

$$ P(w_i|w_{1:i-1}) = P(w_i) $$

Using the chain rule we then describe the joint probability of the entire sequence by:

$$ P(w_{1:n}) = \prod_{i=1}^n P(w_i) $$

Using the uniformity assumption we get:

$$ \begin{align} P(w_{1:n}) &= \prod_{i=1}^n \frac{1}{|V|}\\ &= \frac{1}{|V|^n} \end{align} $$

2

There are 2 best solutions below

0
On BEST ANSWER

In theory I think the two descriptions you give are the same, and that perhaps it is your understanding of conditional probability that is causing the confusion.

In your scenario a), whilst you consider pairs $P(w_i \,| \, w_{i-1})$ as "bigrams", i.e. pairs, once you condition on the specific choice of $w_{i-1}$ the only `randomness' left in the system is the value of$w_i$, which (assuming independence) can take one of $|V|^{-1}$ values.

That is, assuming independence of $w_i,\,w_{i-1}$ and uniform distributions then

$$\mathbf P(w_i\,|\,w_{i-1}) = \frac{1}{|V|}.$$

The important heuristic here is: once you have conditioned on the value of $w_{i-1}$ it is no longer random; so the extra $|V|^{-1}$ factor does not come into play.

0
On

In your Attempt A, the "bigram" you discuss is not $\mathsf P(w_i\mid w_{1:i-1})$ but rather it is $\mathsf P(w_i\cap w_{1:i-1})$ - the joint probability for the $i$th word choice and the preceeding sequence of $i-1$ word choices. Which is of course, $\mathsf P(w_{1:i})$.

What is $\mathsf P(w_i\mid w_{1:i-1})$, actually is the probability for the $i$th word choice when given a preceeding sequence of $i-1$ word choices.   By definition: $$\mathsf P(w_i\mid w_{1:i-1})=\dfrac{\mathsf P(w_i\cap w_{1:i-1})}{\mathsf P(w_{1:i-1})}=\dfrac{\mathsf P(w_{1:i})}{\mathsf P(w_{1:i-1})}$$

And so in declaring that $\mathsf P(w_{1:i})=\mathsf P(w_i\cap w_{1:i-1})=1/\lvert V\rvert^i$ you are assertin that the $i$th word choice is independent from the preceeding $i-1$ word choices. $$\mathsf P(w_i\mid w_{1:i-1})=\mathsf P(w_i)=1/\lvert V\rvert$$