I'm facing di hard (to me) combinatorial problem to solve. I start from N sets with different size, so let's imagine to 13 sets made by 2, 5, 300, 6, 7, 500, 1000, 234, 313, 200, 100, 37, 48 elements each, for instance. Now I need to take these sets and group them into M different sized groups, with the property each group has the "best fit size" nearest to the average of them over M groups. In words, I need to "equidistribute" the N sets in order to have the best "balanced" distribution of them into M groups. In the above case, to clarify, let's take M to be 4, so I need to land to 4 sets of 3 subtsets each. Their sum is: 2752, which is averaged over 4 groups as: 688
So I'd need to group them 4 by 4 in order their sum is nearest to 613 on each group, both below or above.
So here 1000 is highly exceeding the average so it will make a group as whole, i.e. {1000} Then we can have {500, 100} Then we can have {234, 313, 48} Then {37, 300, 200, 2, 5, 6, 7}
Of course the above distribution is by far to be optimal. What I'd need is to find an algo which minimizes the differences by each group sum and the average, as a kind of nesting.
It seems to me this could be related to the pigeon-hole principle, as a guess, but I don't know how to apply it.
I know the problem is a bit tricky to be explained, sorry my English in case.
Thank in advance for your help.
You can solve the problem via integer linear programming as follows. Let $a_s$ be the size of set $s$, and let $t=\frac{1}{M} \sum_{s=1}^N a_s$ be the target group sum. Let binary decision variable $x_{s,g}$ indicate whether set $s$ is assigned to group $g$. The problem is to minimize $$\sum_{g=1}^M \left|\sum_{s=1}^N a_s x_{s,g} - t\right| \tag1$$ subject to $$\sum_{g=1}^M x_{s,g} = 1 \quad\text{for $s\in\{1,\dots,N\}$} \tag2 $$ The objective $(1)$ minimizes the total absolute deviation from the target. Constraint $(2)$ assigns each set to exactly one group.
You can linearize the absolute value by introducing nonnegative variables $y_g^+$ and $y_g^-$ and linear constraints $$\sum_{s=1}^N a_s x_{s,g} - t = y_g^+ - y_g^- \quad \text{for $g\in\{1,\dots,M\}$} \tag3$$
The linearized problem is to minimize $$\sum_{g=1}^M (y_g^+ + y_g^-)$$ subject to $(2)$ and $(3)$.
For your sample data, an optimal solution is \begin{align} x_{7,1}&=1\\ x_{6,2}&=1\\ x_{3,3}&=x_{10,3}=x_{12,3}=x_{13,3}=1\\ x_{1,4}&=x_{2,4}=x_{4,4}=x_{5,4}=x_{8,4}=x_{9,4}=x_{11,4}=1, \end{align} with group sizes $1000,500,585,667$ and objective value $$|1000-688|+|500-688|+|585-688|+|667-688| = 624.$$
If you prefer to instead minimize the difference between largest and smallest groups, you can similarly linearize to obtain group sizes $1000, 600, 576, 576$, for an optimal range of $1000-576=424$.