Based on the Zipf's law or a Zipfian distributed data, how to find a "high frequency" or "low frequency" cutoffs?

19 Views Asked by At

For example, I have a corpus data set with frequency counts of transitions between one syllable or token predicting the other, and this frequency distribution follows the Zipfian distribution:

Token1  predicts  Token'n' Freq
Token a   ->      Token b   24
Token a   ->      Token c   20
Token a   ->      Token d   19
Token a   ->      Token e   7
Token a   ->      Token f   5
Token a   ->      Token g   4
Token a   ->      Token h   3
Token a   ->      Token i   2
Token a   ->      Token j   2
Token a   ->      Token k   1
Token a   ->      Token l   1
Token a   ->      Token m   1
Token a   ->      Token n   1
Token a   ->      Token o   1
.     .      .       .      .
.     .      .       .      .
.     .      .       .      .

In the above frequency count list, the token distribution skews down from high to lower-frequency tokens, with a larger number of low-frequency tokens, which is '1' in the above case. How do we estimate the high- and low-frequency cutoff ranges?

I wonder if the interquartile range is applicable here. Like, anything below Q1 is a low-frequency token, and anything above Q3 is a high-frequency token?

For my project, I need to choose tokens that are high and low frequent and How do we set the high and low cutoffs for Zipfian distribution?