Suppose there are $N$ items that come in $M$ groups. Let $c_i \in \{1, ..., M\}$ for $i=1, ..., N$ represent the group membership for item $i$. I would like to find a coarser partition of the items into $K$ new groups, where $K < M$, with two constraints:
- items in the same group must be assigned to the same partition, and
- the new group sizes should be as close to even as possible.
My initial thought is to formulate this as a non-linear integer program, where $y_{ij} = 1$ if item $i$ is assigned to partition $j$ and is zero otherwise. Then I would have a set of constraints:
- $\sum_{j=1}^K y_{ij} = 1$ for $i=1,..., N$ (each item should be assigned to exactly one partition)
- $y_{ij} = y_{\ell j}$ for all $j=1, ..., K$ if $c_i = c_\ell$ (items in same group must be assigned to same partition)
and then I could define $N_j = \sum_{i=1}^N y_{ij}$ and minimize
$$\sum_{j=1}^K \left(N_j - \frac NK \right)^2.$$
The particular objective doesn't actually matter here, however. So long as $N_j$ is close-ish to $N/K$ for all $j$, I don't care if it's in an $\ell_2$ or $\ell_1$ sense or something else vaguely along those lines.
My questions:
- Is there a better formulation of this problem with a particularly easy solution?
- What algorithms will solve this problem exactly? Are there ways to get fast greedy approximate solutions?
- I presume I'm going to need to leverage some existing optimization software to get my solution. Are there any standard choices here for a Python/Julia/R user? (Code samples much appreciated!)
Some additional background: I'm essentially looking for a more (computationally) efficient approach to leave-group-out cross validation. The current standard is leave a single group out at a time, such that you fit $M$ models, where $M$ can be quite high. In practice, something like $K=5$ or $K=10$ is sufficient for statistical purpose, and cross validation will have the properties we want so long as everybody in the same group goes into the same fold and the folds are about the same size. So fitting $M >> 10$ models when there are many groups is often inefficient and unnecessary.
One approach is to think of the groups as jobs, with the duration of each job equal to the number of items in its group. Now schedule these jobs on $K$ identical machines, minimizing the makespan, that is, minimize $\max_j N_j$. The LPT heuristic is fast and yields a $(2-1/K)$-approximation.