Dropping the "Lowest Grade" Problem

276 Views Asked by At

Let $A=\{a_1, a_2, ... , a_n\}$ be a set of non-negative real numbers and $B=\{b_1, b_2, ..., b_n\}$ be sets of positive real numbers.

Let $s = \dfrac{ \sum_A a}{\sum_B b} = \dfrac{a_1+a_2+\dots+a_n}{b_1+b_2+\dots+b_n}$.

What I want to do find $k$ so that

$s_1 = \dfrac{\sum_{A-\{a_k\}} a }{\sum_{B-\{b_k\}} b}$ is the largest.

And more generally, find

$s_p = \dfrac{\sum_{A-\{a_{k_1},a_{k_2},\dots,a_{k_p}\}} a }{\sum_{B-\{b_{k_1},b_{k_2},\dots,b_{k_p}\}} b}$

I'm not exactly sure what this is called, and I tried doing a google search but I wasn't able to find what I am looking for.

And of course, I can do this by exhaustion, but I feel that there should be a smart way to figure this out...

3

There are 3 best solutions below

3
On

The naive thing is to delete the smallest $\frac {a_i}{b_i}$, but this fails. Let $A=(1,51,100), B=(2,100,100).$ The worst grade is $\frac 12$, but deleting it leaves $\frac {151}{200}$, while deleting the $\frac {51}{100}$ leaves $\frac {101}{102}$, much better. The simple approach is to scale the tests to the same possible score, then it is easy.

0
On

Here is a short exposition on the problem, including some pretty good examples of why greedy algorithms or dynamic programming algorithms probably won't work. The key insight that the authors have is that we can transform this from a "find the maximum" problem to a "find the zero of a function" problem, and that function is, fortunately, pretty easy to evaluate at arbitrary points. Here's the transformation:

We want a subset $S$ of assignments so that:

$$ \frac{\sum_{i \in S} m_i}{\sum_{i \in S} n_i} = q \tag 1$$

is maximized, where $m_i$ is the student's score on assignment $i$ and $n_i$ is the maximum possible score. Rearranging this equation, we get:

$$ \sum_{i \in S} \left(m_i - qn_i\right) = 0 \tag 2$$

The LHS is a linear function of $q$ for any given $S$. We'll just call that $f(q)$ and set about finding the zeros. Note that at the value of $q_{opt}$ for which equation $(1)$ is maximized, any $S$ but the optimal one give the LHS $< q_{opt}$. This translates to the LHS $< 0$ in equation $(2)$. Let's create a new function:

$$F(q) = \text{max}_S \left(\sum_{i \in S} \left(m_i - qn_i\right), |S|=n-k\right)$$

$F$ is piecewise linear, continuous, and, since all of the pieces have negative slope ($-\sum n_i$), then it is monotonic. This means it has exactly one zero. We know from before that that zero is $q_{opt}$, and it occurs during the "piece" where we're using the optimal subset. The magic of $F$ is that it is easy to evaluate at any given $q$ (just choose the largest $|S|$ $m_i - qn_i$ terms). So, we can use arbitrary zero-finding mechanisms to find $q_{opt}$ and, as a corollary, $S$.

The paper I reference above suggests the bisection algorithm and Newton's method. In my experiments, and also, according to them, in their experiments, Newton's method works pretty well. However, it seems to me that $F$ may have up to (but generally has fewer than) $n\choose k$ "pieces". This means that we might well end up doing $n\choose k$ iterations of Newton's method in the worst case. The bisection algorithm is actually potentially worse, since we can presumably make the $q_{opt}$ "piece" arbitrarily small, and therefore arbitrarily difficult to find with the bisection algorithm. So, this method is arguably not better than the brute-force algorithm in the worst case. I do not have any examples of what that worst case may be.

TL; DR

The Newton's method solution seems to be worst-case $O ({n \choose k})$. Possibly, it has a better average case. Certainly, in the situations I've used it, it is way faster than the naïve brute force solution.

Here's python code implementing the Newton's Method solution:

import heapq
def assignment_subset_zero(q, k_drop, scores, points_possible):
    f = scores - q*points_possible
    S = heapq.nlargest(f.shape[0] - k_drop, range(f.shape[0]), key = f.take)
    return np.sum(scores[S])/np.sum(points_possible[S]), S

# find the subset of assignments to optimize the weighted average grade
# scores and points_possible should be equal-length numpy arrays
def optimal_assignment_subset(k_drop, scores, points_possible):
    previous_q = -1
    this_q = 0
    while previous_q != this_q:
        #print(previous_q, this_q)
        previous_q = this_q
        this_q, S = assignment_subset_zero(previous_q, k_drop, scores, points_possible)
    return S

And some code to run the more interesting examples included in the paper:

def dales_quiz_test_set(c, b):
    return (
        np.array([20+c] + [20 + i - b for i in range(1,11)]),
        np.array([40 + i*2 for i in range(11)])
    )

def set_equal(s1, s2):
    return (len(s1) == len(s2)) and (len(s1 & s2) == len(s1))

assert set_equal(set(optimal_assignment_subset(5, *dales_quiz_test_set(4,1))), {0,10,9,8,7,6})
assert set_equal(set(optimal_assignment_subset(5, *dales_quiz_test_set(6,1))), {0,1,2,3,4,5})
2
On

You can solve the problem via mixed integer linear programming as follows. Let binary decision variable $x_j$ indicate whether you keep exam $j$ or not. The problem is to maximize $$\frac{\sum_{j=1}^n a_j x_j}{\sum_{j=1}^n b_j x_j} \tag1$$ subject to $$\sum_{j=1}^n x_j = n - p \tag2$$ You can linearize the objective $(1)$ by introducing nonnegative decision variables $z$ and $y_j$ and the following linear constraints: \begin{align} \sum_{j=1}^n a_j x_j &= \sum_{j=1}^n b_j y_j \tag3 \\ y_j &\le x_j &&\text{for all $j$} \tag4 \\ y_j &\le z &&\text{for all $j$} \tag5 \\ y_j &\ge z - (1-x_j) &&\text{for all $j$} \tag6 \\ \end{align} The idea is that $y_j = z\cdot x_j$. Now maximize $z$ subject to $(2)$-$(6)$.