Why can we use entropy to measure the quality of a language model?

352 Views Asked by At

I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:

...The essential point here is that if a model captures more of the structure of a language, then the entropy of the model should be lower. In other words, we can sue entropy as a measure of the quality of our models...

But how about this example:

Suppose we have a machine that spit $2$ characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.

I am not the designer. And I try to model it through experiment.

During a initial experiment, I see the machine split the following character sequence:

A, B, A

So I model the machine as $P(A)=\frac{2}{3}$ and $P(B)=\frac{1}{3}$. And we can calculate entropy of this model as : $$ \frac{-2}{3}\cdot\log{\frac{2}{3}}-\frac{1}{3}\cdot\log{\frac{1}{3}}= 0.918\quad\text{bit} $$ (the base is $2$)

But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:

$P(A)=\frac{1}{2}$ $P(B)=\frac{1}{2}$

And the entropy of this new model is: $$ \frac{-1}{2}\cdot\log{\frac{1}{2}}-\frac{1}{2}\cdot\log{\frac{1}{2}} = 1\quad\text{bit} $$ The second model is obviously better than the first one. But the entropy increased.

My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.

Could anyone shed some light on this?

3

There are 3 best solutions below

0
On BEST ANSWER

(For more info, please check here: https://stackoverflow.com/questions/22933412/why-can-we-use-entropy-to-measure-the-quality-of-language-model/22942119?noredirect=1#comment35045253_22942119)

After I re-digested the mentioned NLP book. I think I can explain it now.

What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.

To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:

[-500*log(1/3) - 500*log(2/3)]/1000 = 1/2 * Log(9/2)

While the correct 1/2-1/2 model will give:

[-500*log(1/2) - 500*log(1/2)]/1000 = 1/2 * Log(8/2)

So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.

Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.

I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.

0
On

In general, entropy does not tell you if your model is good or bad. But, for natural language, where there is a lot of structure and long range dependencies that are very difficult to capture, we know that the entropy of natural language is lower than a model that cannot capture these long range dependencies.

That being said, it is also possible to have a stupid model with very low entropy, and looks nothing like natural language. So, entropy cannot be used as a sole statistic to quantify how good your model is, but it could provide a guidance.

Your example is nothing like natural language of humans, it's rather a natural language of a coin. :)

0
On

The first calculation is an estimation of the entropy, not the entropy itself. If you were to take infinitely many samples from your machine, the entropy would be 1, which is the maximum entropy for a binary variable.

But imagine that our machine were to only output sequences of A:s and B:s that belonged to the subset of the English language consisting of words with two letters. There are only two such words 'AB' and 'BA', which means that you could represent 'AB' with a 1 and 'BA' with a 0.

A string such as ABBABABAAB could be represented as 10001 giving an average of 0.5 bits per symbol.

So we used our knowledge of the language to build a model with lower entropy than if we just looked at the A:s and B:s by their own.