So this is probably an easy question, but I'm only so-so at math (I'll probably switch to a Pythonesque pseudo-code at some point).
I am developing an application (language learning), and I have 10 separate skill levels that content can be sorted into -
1, 2, 3, 4, 5, 6, 7, 8, 9, and 10
Any individual word can be part of one of these 10 skill levels. 1 is beginner, 9 is expert, 10 are other words that are not part of the skill system.
The problem I am encountering when averaging is that heavier weights are GREATLY over represented. I think it's because I am using an ordinal system as regular integers for calculation - level 10 is NOT 10x harder than level 1.
For example here's a set of numbers from a typical analysis:
[41, 4, 1, 0, 1, 0, 0, 0, 0, 12]
As you can see, the first category is VERY heavily represented - making up the majority of the content, but if I do a simple average, like so:
((41 * 1) + (4 * 2) + (1 * 3) + (1 * 5) + (12 * 10)) / 59
All of those 10s are going to weigh much more heavily than the initial 1s.
I've considered special rules for the 10s, as they are likely to be present occasionally even in lower skill content, but technically it would be true for a 9 as well, in sufficiently low amounts of content.
For example, think of content where the skill level is heavily bifurcated, as so:
[20, 0, 0, 0, 0, 0, 0, 0, 3, 0]
Despite the fact that level 1 words are represented 20 separate times, and there are only 3 words at level 9, you'd have:
((20 * 1) + (3 * 9)) / 23 = 2.04
So 13% of the content is raising an entire skill level.
Is there a way to do my calculation of the average in such a way that I am not essentially using ordinal weight this way?