How can I create a word uniqueness score from normalized text

168 Views Asked by At

I currently have song lyrics in the form of normalized tokens stored in a database.

I have computed both the total amount of unique words, and the word count per song. I am currently (and crudely) calculating the lyrical uniqueness of a song by the following formula:

unique words / total words = uniqueness score where 1.0 is every word is unique and 0.00 = all words are the same

The problem is, that as songs get longer, they are more likely to repeat words, thus longer songs will still reflect a poor uniqueness score despite still having a large and vocabulary interesting vocabulary, and on the other end, very short songs don't repeat words often, and can be rated as more unique than they should be.

I am trying to write a program that take into account how long the song is when calculating the uniqueness. I do understand how to correlate the size of total words to unique words, but I don't know what to do after that. How do I factor in my correlation coefficient into my 0.0-1.0 rating system? Ideally, I'd like to take the unique word count and total word count as parameters, apply my formula, and then assign a handicap based off of the total amount of words.

I don't have a formal academic math background, so it's hard to even know how to ask the right question. Sorry if I am being to obtuse! (pun intended)

1

There are 1 best solutions below

0
On

You might consider something like $$ \text{lyrical uniqueness} = \left( \frac{\text{unique words}}{\text{total words}}\right)^{1/\text{length}} . $$ Then longer songs with the same unique fraction will have lyrical uniqueness scores nearer $1$.

This function is modifiable. You can try multiplying the unique fraction by a constant between $0$ and $1$ (so even a song with all unique words won't have a lyrical score of $1$). You can multiply length by a factor, or raise it to a power, to adjust the rate at which longer songs increase the uniqueness quotient to get the lyrical uniqueness. Play until you get a measure that matches your subjective judgment.