I'm a non-mathematician.
I'm trying to find if tweets are on topic with news articles algorithmically. Part of this involves taking each word from every tweet and seeing if it's in the news articles and counting the matches. If the number is high, it's probably on the same topic, if it's low, it's probably not.
However, if I have 10 tweets and 200 words in the news articles, the match number needs to be normalized vs. 3 tweets on 400 words in news articles. Assuming each tweet has an average of 10 words to make comparison easier. How would one come up with a number of matches that is reflective of the easier or harder time to match based on amount of words?