What is the most fair way to scale grades?

1.4k Views Asked by At

In many universities, professors scale or "curve" grades at the end to ensure (among other things) that there is no grade inflation. I'm interested in studying "fair" ways of doing this from a mathematical standpoint.

Let $S = \{X_1, X_2 \cdots X_k\}$ where $X_i \in [0,100]$ be the multiset of grades for a given class. A $\textit{scale}$ $S'$ of $S$ is some other multiset $S'=\{\phi(X_1), \phi(X_2), \cdots \phi(X_k)\}$ where $\phi:[0,100] \to [0,100]$ is some function. We say a scale is fair if $\phi$ is monotone increasing. Given two fair scales $S'$ and $S''$ with respective scale-functions $\phi, \psi$, we say $S'$ is fairer than $S''$ if $\sum_i |\phi(X_i) - X_i| \leq \sum_i |\psi(X_i) - X_i|$

Let us suppose that the professor wants to scale the grades such that the mean grade is $70 \pm 5 \%$. Given the above definitions, which scale function $\phi$ should he choose to ensure the scale is as fair as possible? If there's not a simple function that always works, is there an algorithm or a strategy that might be helpful?

This is, of course, but one model. There's also issues of subjectivity associated with the word "fairness". Perhaps there's some notion of "fairness" that this model doesn't quite capture. If so, please mention it. My opinion is that the "fairest" way of scaling is ensuring that the scaling preserves the original order, and disturbs the original dataset as little as possible.

One other possible notion (which you may consider if you are interested in, but not specifically the one I've chosen to ask about) is considering the double sum $$\sum_{i,k} \left||\phi(X_i) - X_i| - |\phi(X_k) - X_k|\right|$$ and trying to minimize this among all possible (fair/monotone) scale functions $\phi$. With my original model above, a scale is "fair" if it doesn't disturb the original dataset much. With this above model, a scale may disturb the original dataset a lot, but it still might be quite fair so long as students' grade are all altered a similar amount (for instance, a fixed scale of $20$%).

Feel free to discuss other mathematically rigorous notions of "fair" scaling which you believe are pertinent, or possibly cite relevant literature.

3

There are 3 best solutions below

0
On BEST ANSWER

This answer proposes another definition of a "fair scale" and "fair exam" than the one proposed in the question.

We can approach this question from an information theoretic standpoint. In this perspective, grades should be informative about some underlying quality of students to solve certain problems. As such, there is a "true distribution" of various skills and abilities students have. Unfortunately, most likely these qualities are multidimensional but we need to "compress" these into an ordinal scale. This entails making some strange judgments such as "making a typo in an equation is 0.3 times as bad as accidentally multiplying both sides by zero". But suppose we have obtained some acceptable scale expressed as integer scores from 0 to 100. I am suspicious of cardinal scales and therefore I attach only ordinal meaning to these numbers (for now).

Importantly, if we would observe the results for the entire student pool there would be no need to ever rescale. The need to rescale arises because we observe different exams (information structures about the hypothetical score of students on "the true scale") for different parts of the student pool and want to make the scores between students comparable. In particular, if I believe to have set two similarly difficult exams to two random samples of 1000 students but in one case all students receive 0 points and in the other exam all students receive 100 points, then I should revise my belief about whether I truly set similarly difficult exams. If the sample size for each group is only 10 students, I won't update my belief as much and will be more reluctant to rescale the exam.

Now let's say for simplicity that we have observed the scores of two exams for the entire pool (or for each exam a sufficiently large sample) of students. Let's suppose that a density estimate of the distribution of scores of each exam is given by $s_1:[0,100]\rightarrow \mathbb{R}$ and $s_2:[0,100]\rightarrow \mathbb{R}$ with $\int s_i(x)dx=1$. Now we are looking for transformations $t_1:[0,100]\rightarrow [0,100]$ and $t_2$ to make the exams comparable.

As posted in the question, there is a strong case for preserving the order of the scores, thus $t_1$ and $t_2$ should be strictly increasing functions. However, I do not see why there should be a strong case for maintaining point differences or maximizing an objective as given in the question. If we were unable to set an exam such that the distribution of scores is equal to our target distribution for a large student sample, then there is no good reason to attach any cardinal meaning to these scores. However, we want to make the scores a) comparable to each other and b) loose as little information in this process as possible. I therefore propose to impose that

a) Comparability holds: $s_i(t_i(x)) = p^*(x)$ for all $x\in [0,100], i\in \{1,2\}$ and

b) Minimal information loss holds: $$p^* = \arg \min_p \sum_i D_{KL}(p,s_i)$$ where $D_{KL}$ is the Kullback Leibler distance of the distributions.

In practice, we of course observe the different exams year by year so one has to fix $p$ before the exams are taken. My simple rule of thumb for this is to try to get as close as possible to the maximum entropy distribution (uniform) of scores with both exams and adjusted scores. Ideally, I would only want to report percentiles to the department office. Unfortunately, rules such as "students below some cutoff fail and need to retake" prevent me from doing this and require choosing a different $p^*$ instead. (This means that to minimize the KL distance one has to adjust exams to match the target distribution rather than the other way around.) Also, it is hard to explain to students why one uses a crazy wiggle of a function to rescale scores, so $t_i$ tends to be "smoothed out" a bit.

tl;dr: My own idea on "fair grading and scaling" is that the exam should be designed to have a maximum entropy of scores from an ex ante perspective. Ex post, once I learn that an exam was too hard/easy I look for an order preserving map which yields the targeted distribution if many students have taken the exam. Once there are only few students in class... ...things become complicated.

8
On

I think the unfortunate truth is that the only fair scaling of grades is to not scale them at all.

Outside of the mathematical framework you want to consider, curving or scaling grades can only penalize those students who work hard and would have otherwise received high grades. Particularly in the case of a flat curve (where everyone gets $+x\%$), I find it to be the definition of unfairness that someone could receive an A when they only did enough correct work to earn a B, or heaven forbid a C.

But the unfairness doesn't extend to just the student body; if there are scholarships tied to GPAs on the line, organizations might end up misspending money on students that aren't actually doing the work that they should. Employers might end up passing over a candidate with fewer credentials (but who would be a better fit) because they think that someone else has a better transcript. And so on...

But, even in the context of the model you have presented, in both of the metrics you have proposed the map $\phi:[0,100]\rightarrow[0,100]$ which is the "fairest" is just the identity map. By not curving at all, you are always guaranteed to be fair.

Now, you can argue that in order for the identity scale to be fair the professor has to do their job correctly and adequately, and that the inability of universities to promise that professors are doing their jobs well is why we tolerate curves, but I think the solution should simply be to fire those people who can't teach, or at the very least don't let them teach anything, rather than alter the metric by which we judge mastery of topics, particularly when the rest of society has to use that metric to decide who gets the contract to build that bridge (or any other "important" function that an individual might serve).

2
On

This is a cute little problem. I have several things to say about it. Before I do, let's introduce some notation.

Define $d_\phi=\sum_{i=1}^n|\phi(X_i)-X_i|$, and let $[a,b]$ denote the target class average. (You have set $[a,b]=[65,75]$, but the numerical values don't really matter as to the structure of the problem.) Without loss of generality, suppose $X_1\leq X_2\leq\cdots\leq X_n$.

(1) Notice that we don't really need to find a function on $[0,100]$. Rather, we just need a function from $S$ into $[0,100]$.

Obviously, if $\mathbb{E}(S)\in[a,b]$ then we let $\phi$ be the identity operator. The remaining cases are where $\mathbb{E}(S)<a$ or $\mathbb{E}(S)>b$. But...

(2) Note that under realistic circumstances we must always have $\phi(X_i)\geq X_i$. With this additional constraint, it may not be possible to find $\phi$ satisfying $\mathbb{E}[\phi(S)]\in[a,b]$. In particular, if $\mathbb{E}(S)>b$ and $\phi$ is anything but the identity map, then $\phi$ will only decrease fairness (i.e., increase $d_\phi$) while separating the class average further away from the target range. The only case that remains is where $\mathbb{E}(S)<a$.

(3) If $\mathbb{E}(S)<a$ then we can minimize $d_\phi$ subject to the constraint $\mathbb{E}[\phi(S)]\in[a,b]$ by guaranteeing $$\sum_{i=1}^\infty\phi(X_i)=na.$$ Clearly, such a function $\phi$ exists, and is not unique.

(4) Note that ideally we would also wish to minimize the quantity $$\|\phi(S)-S\|_\infty=\max_i|\phi(X_i)-X_i|.$$ In fact, in real life I should think that this is a greater priority than minimizing $d_\phi$. However it turns out that there is a function $\phi$ which will minimize both. For instance, we could simply find $c\geq 0$ such that $\mathbb{E}(S+c)=a$, provided $X_n\leq 100-c$. Of course, this may not work in general since we might have $X_n>100-c$.

Fortunately, this is not a great difficulty. The function $\phi=\phi_c$ is now given by the following: $$\phi_c(X_i)=\left\{\begin{array}{ll}X_i+c&\text{ if }X_i<100-c,\\100&\text{ if }X_i\geq100-c,\end{array}\right.$$ where $\mathbb{E}[\phi_c(X_i)]=a$. There is a unique solution to this problem, and although it is annoying to compute in general, it's quite easy to compute given some concrete set $S$.

(5) Let's get back to real life. A grading scale is stipulated by the syllabus, which is a contract between instructor and students. And although it is technically permitted for an instructor to go Darth Vader and alter the deal at the last minute, it's almost always a very bad idea.

If you have any freedom for curving, you should look at the students rather than use a silly math formula. You should ask yourself, "judging from my impression of his work, is Joe student ready to pass this course?" People like to pretend that grading is objective. It's not. You have to make judgment calls. Math can help you with that, but at the end of the day you have to make your best call.