I'm trying to find the best way to assign a "frequency score" to sentences in Chinese. Basically, I have a database which tells me how frequent each Chinese character is. From that, I would like to evaluate how "easy" a sentence is compared to other sentences. i.e. how easy it might be for a beginner to understand the sentence.
My first approach was to average all the character frequencies in the sentence. However, I found that certain sentences with very uncommon characters end up having a higher frequency score than sentences with more common characters. I think this is because it only takes a few very common characters to really increase the total score.
For example, here are two sentences of six characters (each number represents the frequency score of a character):
10 1 2 0 9 1 = 23
3 2 4 3 3 5 = 20
In this example, the second sentence is likely to be easier because all the characters are reasonably frequent. The first sentence has a higher score because of the two "10" and "9" characters. However, the "0" and "1" characters will make it hard to understand for beginners.
So I was wondering - what would be the best way to calculate the frequency score in this case?
Honestly, this has more to do with Chinese Language than Math.
Anyway, I would recommend taking the maximum percentile of the character for determining the difficulty of the sentence. As a learner myself, I would base my opinion of the sentence squarely on the most difficult portion of the sentence.
But this is just one opinion, you should probably play with different styles (range, mean etc)
Here are a few references which may help:
The database that you talk about might probably be a derivative of this or this.
In fact, the second link states:
Frequency-weighted average number of strokes:
The first link states:
You could probably use this for your decision?
Also, have you considered just deciding on the basis of HSK Levels?