Statistical analysis of neural net results

45 Views Asked by At

I have a language recognition neural net that will decide if a word or phrase is written in English, Spanish or German by going threw the phrase word by word and returning a percent confidence that the word is in each language (between zero and one).

Example:

===================input: math is cool ===================

[math is cool] -> [english: 90%, german: 7%, spanish: 6%]

So for the phrase "Math is cool", it would decide the likelihood that each word is in each language, average those percentages together and use those averages as it's final guess.

My issue is that words that are super common in one language, but still used somewhat often in one of the other languages are skewing my results. For example, the word "die" in German means "the", but "die" is also a pretty common English word. Because the training data used for the neural net is just copy and pasted from wikipedia pages, the word "die" appears super often in the German training data and the neural net has learned "die" as a strictly German word: [die] -> [english: 2%, german: 97%, spanish: 3%].

I need a way to analyze the data to negate the outlier "die" in the sentence "I don't want to die", without negating the high English results from the rest of the words.

P.S. I'm a rising sophomore in college working on a personal project, not a professional. I haven't taken any stat classes yet so this is a little beyond my area of expertise. I've tried a couple random guesses (like averaging the square roots and then squaring the result), but nothing that came even close to working. I apologize in advance if this is super easy or a dumb question.