In many universities, professors scale or "curve" grades at the end to ensure (among other things) that there is no grade inflation. I'm interested in studying "fair" ways of doing this from a mathematical standpoint.
Let $S = \{X_1, X_2 \cdots X_k\}$ where $X_i \in [0,100]$ be the multiset of grades for a given class. A $\textit{scale}$ $S'$ of $S$ is some other multiset $S'=\{\phi(X_1), \phi(X_2), \cdots \phi(X_k)\}$ where $\phi:[0,100] \to [0,100]$ is some function. We say a scale is fair if $\phi$ is monotone increasing. Given two fair scales $S'$ and $S''$ with respective scale-functions $\phi, \psi$, we say $S'$ is fairer than $S''$ if $\sum_i |\phi(X_i) - X_i| \leq \sum_i |\psi(X_i) - X_i|$
Let us suppose that the professor wants to scale the grades such that the mean grade is $70 \pm 5 \%$. Given the above definitions, which scale function $\phi$ should he choose to ensure the scale is as fair as possible? If there's not a simple function that always works, is there an algorithm or a strategy that might be helpful?
This is, of course, but one model. There's also issues of subjectivity associated with the word "fairness". Perhaps there's some notion of "fairness" that this model doesn't quite capture. If so, please mention it. My opinion is that the "fairest" way of scaling is ensuring that the scaling preserves the original order, and disturbs the original dataset as little as possible.
One other possible notion (which you may consider if you are interested in, but not specifically the one I've chosen to ask about) is considering the double sum $$\sum_{i,k} \left||\phi(X_i) - X_i| - |\phi(X_k) - X_k|\right|$$ and trying to minimize this among all possible (fair/monotone) scale functions $\phi$. With my original model above, a scale is "fair" if it doesn't disturb the original dataset much. With this above model, a scale may disturb the original dataset a lot, but it still might be quite fair so long as students' grade are all altered a similar amount (for instance, a fixed scale of $20$%).
Feel free to discuss other mathematically rigorous notions of "fair" scaling which you believe are pertinent, or possibly cite relevant literature.
This answer proposes another definition of a "fair scale" and "fair exam" than the one proposed in the question.
We can approach this question from an information theoretic standpoint. In this perspective, grades should be informative about some underlying quality of students to solve certain problems. As such, there is a "true distribution" of various skills and abilities students have. Unfortunately, most likely these qualities are multidimensional but we need to "compress" these into an ordinal scale. This entails making some strange judgments such as "making a typo in an equation is 0.3 times as bad as accidentally multiplying both sides by zero". But suppose we have obtained some acceptable scale expressed as integer scores from 0 to 100. I am suspicious of cardinal scales and therefore I attach only ordinal meaning to these numbers (for now).
Importantly, if we would observe the results for the entire student pool there would be no need to ever rescale. The need to rescale arises because we observe different exams (information structures about the hypothetical score of students on "the true scale") for different parts of the student pool and want to make the scores between students comparable. In particular, if I believe to have set two similarly difficult exams to two random samples of 1000 students but in one case all students receive 0 points and in the other exam all students receive 100 points, then I should revise my belief about whether I truly set similarly difficult exams. If the sample size for each group is only 10 students, I won't update my belief as much and will be more reluctant to rescale the exam.
Now let's say for simplicity that we have observed the scores of two exams for the entire pool (or for each exam a sufficiently large sample) of students. Let's suppose that a density estimate of the distribution of scores of each exam is given by $s_1:[0,100]\rightarrow \mathbb{R}$ and $s_2:[0,100]\rightarrow \mathbb{R}$ with $\int s_i(x)dx=1$. Now we are looking for transformations $t_1:[0,100]\rightarrow [0,100]$ and $t_2$ to make the exams comparable.
As posted in the question, there is a strong case for preserving the order of the scores, thus $t_1$ and $t_2$ should be strictly increasing functions. However, I do not see why there should be a strong case for maintaining point differences or maximizing an objective as given in the question. If we were unable to set an exam such that the distribution of scores is equal to our target distribution for a large student sample, then there is no good reason to attach any cardinal meaning to these scores. However, we want to make the scores a) comparable to each other and b) loose as little information in this process as possible. I therefore propose to impose that
a) Comparability holds: $s_i(t_i(x)) = p^*(x)$ for all $x\in [0,100], i\in \{1,2\}$ and
b) Minimal information loss holds: $$p^* = \arg \min_p \sum_i D_{KL}(p,s_i)$$ where $D_{KL}$ is the Kullback Leibler distance of the distributions.
In practice, we of course observe the different exams year by year so one has to fix $p$ before the exams are taken. My simple rule of thumb for this is to try to get as close as possible to the maximum entropy distribution (uniform) of scores with both exams and adjusted scores. Ideally, I would only want to report percentiles to the department office. Unfortunately, rules such as "students below some cutoff fail and need to retake" prevent me from doing this and require choosing a different $p^*$ instead. (This means that to minimize the KL distance one has to adjust exams to match the target distribution rather than the other way around.) Also, it is hard to explain to students why one uses a crazy wiggle of a function to rescale scores, so $t_i$ tends to be "smoothed out" a bit.
tl;dr: My own idea on "fair grading and scaling" is that the exam should be designed to have a maximum entropy of scores from an ex ante perspective. Ex post, once I learn that an exam was too hard/easy I look for an order preserving map which yields the targeted distribution if many students have taken the exam. Once there are only few students in class... ...things become complicated.