Consider a group of 75 students who sit an exam consisting of 20 open questions, and are then randomly divided into 3 groups of 25 students {A, B, C} for grading by 3 different persons.
Let us suppose that the whole population is normally distributed and we expect roughly the same in each individual group of N = 25.
Given that the grades in one group are considerably skewed in comparison to the other two, I intuitively expect that the evaluator is biased and the grading is unfair, but how can I express this mathematically?
How can I establish the probability and magnitude of such a bias given these conditions?
Even though you are assuming that unbiased student scores should be normally distributed, you are actually trying to make inferences on a very different object...a sample of 75 actual scores. Of course, these scores will be related to the underlying normal distribution, but that underlying normal distribution only holds exactly when the class size is infinite. With only 75 students, you actually have two levels of uncertainty:
What you are tying to determine is not the (discrete) distribution of unbiased scores (they are what they are, and we don't know what they are if bias exists), but the degree to which "grader effects" distorts this distribution. For example, if you have an average class, then their unbiased scores will be a sample of size 75 from a normal distribution.
The subtlety here is that you are not comparing the graders' scores to an iid sample of size 25 from a normal distribution, but to a partition of the 75 unbiased actual scores into three groups of 25 -- this will induce a negative covariance among the partitions, since the statistical properties of each partition are related: they must give the overall distribution. Therefore, the underlying distribution for inference is not the normal distribution, but the, unfortunately unknown, unbiased (and discrete) student score distribution, which is generated by an underlying normal distribution.
Fortunately, there is a conceptually simple way to test your concern: Bootstrap Permutation Tests. The permutation test is nonparametric, and so does not require normality. It is also computationally intensive, but conceptually very simple:
This is not a true permutation procedure as there are about $10^{17}$ possible ways choose 25 scores out of 75. Instead, you will be forming random permutations, which will approximate the actual set of full permuatations, and should get you a very accurate approximation.
Once you've done the above, you will have created the empirical distribution of each RANK; that is, the distribution of the lowest average grader score, middle average grader score, and top average grader score. Then, you simply need to compare your actual average grader scores to their respective rank distrbutions.
For example, if your "skewed" grader was giving very low scores (so their average grade was lower than the other two graders'), then you would compare their average grade to the distribution of average grades with RANK=3. You could then do a one-sided test (lower tailed) by examining the number of permutation cases that have a RANK 3 average score at or below your observed RANK 3 average grader score. If that works out to $\leq 1$%, then I'd be somewhat suspicious.