Partition $N$ items into $K$ equally sized partitions while retaining groups in original items

145 Views Asked by At

Suppose there are $N$ items that come in $M$ groups. Let $c_i \in \{1, ..., M\}$ for $i=1, ..., N$ represent the group membership for item $i$. I would like to find a coarser partition of the items into $K$ new groups, where $K < M$, with two constraints:

  1. items in the same group must be assigned to the same partition, and
  2. the new group sizes should be as close to even as possible.

My initial thought is to formulate this as a non-linear integer program, where $y_{ij} = 1$ if item $i$ is assigned to partition $j$ and is zero otherwise. Then I would have a set of constraints:

  1. $\sum_{j=1}^K y_{ij} = 1$ for $i=1,..., N$ (each item should be assigned to exactly one partition)
  2. $y_{ij} = y_{\ell j}$ for all $j=1, ..., K$ if $c_i = c_\ell$ (items in same group must be assigned to same partition)

and then I could define $N_j = \sum_{i=1}^N y_{ij}$ and minimize

$$\sum_{j=1}^K \left(N_j - \frac NK \right)^2.$$

The particular objective doesn't actually matter here, however. So long as $N_j$ is close-ish to $N/K$ for all $j$, I don't care if it's in an $\ell_2$ or $\ell_1$ sense or something else vaguely along those lines.

My questions:

  1. Is there a better formulation of this problem with a particularly easy solution?
  2. What algorithms will solve this problem exactly? Are there ways to get fast greedy approximate solutions?
  3. I presume I'm going to need to leverage some existing optimization software to get my solution. Are there any standard choices here for a Python/Julia/R user? (Code samples much appreciated!)

Some additional background: I'm essentially looking for a more (computationally) efficient approach to leave-group-out cross validation. The current standard is leave a single group out at a time, such that you fit $M$ models, where $M$ can be quite high. In practice, something like $K=5$ or $K=10$ is sufficient for statistical purpose, and cross validation will have the properties we want so long as everybody in the same group goes into the same fold and the folds are about the same size. So fitting $M >> 10$ models when there are many groups is often inefficient and unnecessary.

2

There are 2 best solutions below

1
On

One approach is to think of the groups as jobs, with the duration of each job equal to the number of items in its group. Now schedule these jobs on $K$ identical machines, minimizing the makespan, that is, minimize $\max_j N_j$. The LPT heuristic is fast and yields a $(2-1/K)$-approximation.

6
On

First question: In the IP model you do not need a binary variable for each combination of item and partition. Given your requirement that groups be kept together, you just need a binary for each combination of group and partition. Your $y_{ij}=y_{\ell j}$ constraints will let the solver's presolve function shrink the model down to this size, but you might as well just start with the smaller formulation. Also, rather than make the problem quadratic, I would probably minimize the difference between smallest and largest partition size, which is linear. This does not necessarily produce a "particularly easy" to solve model, but depending on your problem dimensions and your IP solver (and your patience), it may be easy enough.

Second question: You can solve the problem exactly using the IP model and an IP solver. A fast heuristic that might do reasonably well is to start with $K$ empty partitions, sort the groups into descending size order, then assign each group to the currently smallest partition.

Third question: I can't speak for Julia or Python (although I do know of some IP solvers for Python), but with R I would be inclined to use the OMPR package (a DSL for LP/IP) for writing the model. OMPR will in turn rely on ROI for solving the model, and both OMPR and ROI will require you to load a solver-specific plug-in (and, of course, to have the corresponding solver installed).

I hacked an R notebook using OMPR and ROI with their respective CPLEX plug-ins. On a random test problem with $N=5700$, $M=130$ and $K=10$, the heuristic I described typically got a partition size spread of 5 (sizes ranging from 567 to 572), and the IP model got ten partitions of 570 each (spread = 0). The heuristic took a (small) fraction of a second. Building the IP model and solving it with CPLEX took around nine seconds.

As always, your mileage will vary.

ADDENDUM: I suspected (correctly) that using round numbers for the problem dimensions might be making things nicer, so I tried $N=5723$, $M=137$ and $K=10$ (which ensures that no solution has all partition sizes identical). The IP solution managed a spread of 1 (some partitions had 572 items, some had 573, which is still better than I think is generally achievable). The heuristic solution had a spread of 30 (partition sizes ranging from 552 to 582).

ADDENDUM 2: I added a pairwise interchange heuristic after what Rob is calling the LPT heuristic. In the example with $N=5723$ etc., the pairwise swap heuristic reduced the spread from 30 to 2, not quite optimal (optimal being 1) but a lot closer. Like LPT, the swapping heuristic took well under one second on this example.