I am grading 10 student projects with three other assessors, and the top two students will win a prize so it's important that we try and account for any grader that is overly kind or harsh in their scores.
Each project is graded (marks out of 5) by two assessors (at random from the four assessors overall) and is scored on three criteria with equal importance (originality, method, presentation). Each three grades from an assessor is averaged to produce an average score (out of five); and then the two average scores for each student project are again averaged to a final average mark out of five.
The results look like this, where the green/blue/orange/red are the four assessors.
You can see the total average score in the purple column.
But you can also see that according to "How_good_actually?" (n.b. see edit note about this column at the bottom of this post) we have a problem... John should have been the best project, but because he was given the red grader he has dropped out of the top two, because the red grader is very harsh. Also, Chris has managed to get into the top two for a prize despite being an average student because he got the green grader, the kindest. What's more, Joe should have been in with a shout for the other top-two position, but the red grader also knocked him down.
You can see from the next table, that the red grader has been consistently mean across the board and the green grader has been overly generous. The blue and orange graders were about right, giving roughly as many marks as they should have done.
So, my question is, how can we apply some sort of normalisation to each assessors average scores, so that according to the overall population/assessors their average scores are adjusted up/down according to how generous/harsh they have been?
The data for the main table is here:
ID Student_name How_good_actually? Average_score_total Assessor_1_average Assessor_1_criteria_1 Assessor_1_criteria_2 Assessor_1_criteria_3 Assessor_2_average Assessor_2_criteria_1 Assessor_2_criteria_2 Assessor_2_criteria_3
2 Jane 4 4.33 5 5 5 5 3.67 4 3 4
5 Chris 3 3.83 4 4 5 3 3.67 4 3 4
1 John 5 3.67 4 4 3 5 3.33 4 3 3
3 Joe 4 3.67 4.67 5 4 5 2.67 3 2 3
4 Sarah 3 2.83 3.33 4 3 3 2.33 2 2 3
7 Ray 3 2.83 4 4 4 4 3 3 2 4
6 Mary 3 2.83 2.67 2 3 3 1.67 2 1 2
8 Sandra 2 2.33 2.33 2 3 2 2.33 2 2 3
9 Paul 2 2.33 3 3 3 3 1.67 1 2 2
10 Rebecca 1 1.17 1.33 1 2 1 1 1 1 1
Finally, sorry if I've tagged this wrong but I've tried to pick some relevant tags for the question as best I can. Not being a mathematician makes this hard as I have no idea what the best answer will be. I thought that perhaps z-scores would be appropriate but couldn't find the tag.
EDIT NOTE:
Regarding the comment from @Paul -- I have added the "How good actually" column only to illustrate that the red was harsh and the green overly good. But there would of course not be such a "how good actually"; so perhaps it was unhelpful to add this column. Nevertheless, it remains the case that green used more marks overall than the average assessors and red used less overall. So my question is really to learn how to normalise these two assessors to fit the pack better. If that makes sense?

