For example, I have a corpus data set with frequency counts of transitions between one syllable or token predicting the other, and this frequency distribution follows the Zipfian distribution:
Token1 predicts Token'n' Freq
Token a -> Token b 24
Token a -> Token c 20
Token a -> Token d 19
Token a -> Token e 7
Token a -> Token f 5
Token a -> Token g 4
Token a -> Token h 3
Token a -> Token i 2
Token a -> Token j 2
Token a -> Token k 1
Token a -> Token l 1
Token a -> Token m 1
Token a -> Token n 1
Token a -> Token o 1
. . . . .
. . . . .
. . . . .
In the above frequency count list, the token distribution skews down from high to lower-frequency tokens, with a larger number of low-frequency tokens, which is '1' in the above case. How do we estimate the high- and low-frequency cutoff ranges?
I wonder if the interquartile range is applicable here. Like, anything below Q1 is a low-frequency token, and anything above Q3 is a high-frequency token?
For my project, I need to choose tokens that are high and low frequent and How do we set the high and low cutoffs for Zipfian distribution?