theoretical and observed word probabilities disagreement

73 Views Asked by At

I am missing something obvious but cannot get an hand on it.

I want to compute the probabilities of words with particular features to appear by change.

Given a set of three characters {A, B, " "} I want to compute for each word starting and finishing by 'A' their probabilities.

A word is defined by X randomly sampled letters with N spaces before and M spaces after, or starting and finishing at the end of a sentence, and N and M being superior or equal to 1. A word must also start and end with the letter "A". Every other sequence of characters are discarded and not taken into account.

The characters don't have the same probabilities to appear:

P(x="A") = 0.3

P(x="B") = 0.6

P(x=" ") = 0.1

I generate 10000 sentences of length 160 (it's setup arbitrarily). I run the simulation twice and observed, as expected, some slight change between the observed probabilities.

From the datasets I observed:

P(w="A") = 0.27 / 0.25

P(w="AA") = 0.078 / 0.077

P(w="ABA") = 0.043 / 0.048

...

My problem is to compare them with the theoretical probability. If I applied the formula:

P(w=X) = 0.3^("A" \in X) * 0.6^("B" \in X)

The proba becomes quite different:

P(w="A") = 0.27 / 0.25 / 0.3

P(w="AA") = 0.078 / 0.077 / 0.3^2 = 0.09

P(w="ABA") = 0.043 / 0.048 / 0.3^2 * 0.7 = 0.054

Is it something that I should expect? Or, am I missing something obvious (my guess)?

1

There are 1 best solutions below

1
On BEST ANSWER

Analyzing the words from the beginning, we can take the initial A as given. We then have a probability $0.1$ of generating the admissible word $A$ if a space follows, and a probability $0.9\cdot\frac{0.3}{0.3+0.6}=0.3$ of generating an admissible word if a letter follows (with $\frac{0.3}{0.3+0.6}$ being the probability that the letter preceding the space that will eventually occur is an A). Thus the probability of forming an admissible word, starting with an initial A, is $0.4$, so you need to divide the unconditional probability of forming a word by $0.4$ to obtain its conditional probability among admissible words.

For instance, the word A has unconditional probability $0.1$, so its conditional probability is $\frac{0.1}{0.4}=\frac14$. Similarly, AA has unconditional probability $0.3\cdot0.1=0.03$, so its conditional probability is $\frac{0.03}{0.4}=0.075$, and ABA has unconditional probability $0.6\cdot0.3\cdot0.1=0.018$, so its conditional probability is $\frac{0.018}{0.4}=0.045$. All of these are in agreement with your observations.

If you want to calculate further probabilities, you can simplify things a bit by noting that there's always a factor $0.3\cdot0.1=0.03$ from the final A and space, so you can combine those with the denominator into a factor $\frac{0.3\cdot0.1}{0.4}=0.075$ by which you have to multiply the product of the probabilities of all the "internal" letters (the ones other than the initial and final A) to get the conditional probability of a word among the admissible words.