I am not sure this is the right venue for this question: if you think it is not, I will move it to another StackExchange board.
Explanation
I am trying to calculate phonotactic probability on a corpus of consonant-vowel-consonant (CVC) words.
Phonotactic probability can be defined as follows:
Phonotactic probability refers to the frequency with which a phonological segment, such as /s/, and a sequence of phonological segments, such as /s^/, occur in a given position in a word (Kansas University)
And can be calculated on segments of various length, such as monograms, bigrams, trigrams, etc. In simple terms:
For a word like blick in English, the unigram average would include the probability of /b/ occurring in the first position of a word, the probability of /l/ in the second position, the probability of /ɪ/ occuring in the third position, and the probability of /k/ occurring in the fourth position of a word. Each positional probability is calculated by summing the log token frequency of words containing that segment in that position divided by the sum of the log token frequency of all words that have that position in their transcription. The bigram average is calculated in an equivalent way, except that sequences of two segments and their positions are used instead of single segments. So for blick that would be /bl/, /lɪ/, /ɪk/ as the included positional probabilities
In matemathical terms, the phonotactic probability of the word blick would be calculated as follows:
Given an example corpus (e.g., the one from CorpusTools documentation):
The phonotactic probability of blick would be calculated as follows:
Question
Unfortunately, in my corpus, some bigrams (e.g., /Aw/, 'Aw': [1, 1, 1, 1]
) appear in only words with token frequency one. As such, when calculating the sum of log() frequencies of words starting with /Aw/, the sum ammounts to
log(1) + log(1) + log(1) + log(1)
Which in turn returns 0. Indeed, this is not very representative, since the bigram appears in at least four words. Is there a way to deal with this from a mathematical perspective?


