Establishing fairness in test grading

104 Views Asked by At

Consider a group of 75 students who sit an exam consisting of 20 open questions, and are then randomly divided into 3 groups of 25 students {A, B, C} for grading by 3 different persons.

Let us suppose that the whole population is normally distributed and we expect roughly the same in each individual group of N = 25.

Given that the grades in one group are considerably skewed in comparison to the other two, I intuitively expect that the evaluator is biased and the grading is unfair, but how can I express this mathematically?

How can I establish the probability and magnitude of such a bias given these conditions?

1

There are 1 best solutions below

2
On BEST ANSWER

Even though you are assuming that unbiased student scores should be normally distributed, you are actually trying to make inferences on a very different object...a sample of 75 actual scores. Of course, these scores will be related to the underlying normal distribution, but that underlying normal distribution only holds exactly when the class size is infinite. With only 75 students, you actually have two levels of uncertainty:

  1. A "discreteness" effect...no finite sample will have an exact normal distribution. With 75 students, the actual (and unknown!) distribution of unbiased scores on the test may be quite non-normal, even bi-modal! That means that even if you knew the unbiased scores and plotted them, you would not see a normal distribution. To see this, generate 75 iid random variables (e.g., in excel), and look at the cumulative distribution. Compare this to the actual/underlying normal distribution. There will be a lot of sample-to-sample variation about the Normal CDF.
  2. A possible "grader" effect, whose magnitude you are tying to determine.

What you are tying to determine is not the (discrete) distribution of unbiased scores (they are what they are, and we don't know what they are if bias exists), but the degree to which "grader effects" distorts this distribution. For example, if you have an average class, then their unbiased scores will be a sample of size 75 from a normal distribution.

The subtlety here is that you are not comparing the graders' scores to an iid sample of size 25 from a normal distribution, but to a partition of the 75 unbiased actual scores into three groups of 25 -- this will induce a negative covariance among the partitions, since the statistical properties of each partition are related: they must give the overall distribution. Therefore, the underlying distribution for inference is not the normal distribution, but the, unfortunately unknown, unbiased (and discrete) student score distribution, which is generated by an underlying normal distribution.

Fortunately, there is a conceptually simple way to test your concern: Bootstrap Permutation Tests. The permutation test is nonparametric, and so does not require normality. It is also computationally intensive, but conceptually very simple:

  1. Take all the grades and mix them together into a single large sample.
  2. Randomly assign (without replacement!!) a grade to the first grader
  3. Do this for the second and third graders, in order.
  4. Repeat 2 and 3 until all students are assigned.
  5. Calculate the average grade for each grader.
  6. Determine the ranking of the average grade (e.g., highest grade =1, lowest = 3)
  7. Save the (average grade, rank) pairs for this run.
  8. Repeat lots of times. E.g., 10,000 - 30,000 times. (on a computer, obviously :)

This is not a true permutation procedure as there are about $10^{17}$ possible ways choose 25 scores out of 75. Instead, you will be forming random permutations, which will approximate the actual set of full permuatations, and should get you a very accurate approximation.

Once you've done the above, you will have created the empirical distribution of each RANK; that is, the distribution of the lowest average grader score, middle average grader score, and top average grader score. Then, you simply need to compare your actual average grader scores to their respective rank distrbutions.

For example, if your "skewed" grader was giving very low scores (so their average grade was lower than the other two graders'), then you would compare their average grade to the distribution of average grades with RANK=3. You could then do a one-sided test (lower tailed) by examining the number of permutation cases that have a RANK 3 average score at or below your observed RANK 3 average grader score. If that works out to $\leq 1$%, then I'd be somewhat suspicious.