Selecting one number from each set with minimum variance

85 Views Asked by At

I have a dataset that I need to find several sets of "similar" looking events across many days, which leads to the following problem.

Suppose I have $N \sim 500$ sets, each containing $N_j \in [5, 20]$ or so numeric values. I'd like to split these into groups, by selecting exactly 1 number from each list, with the goal of having each collection of the $N$ selected numbers have low range or variance (the exact statistic isn't important, but more similar sets of values are preferred). As stated, the maximum number of such selections I can make is set by $\min N_j$, however, I'd really like to maximize the number of such groups I can make, so I'm okay with dropping some of the $N$ lists. So the problem is: given $N$ sets, select $\hat{N}$ sets. Then make $K$ groups of $\hat{N}$ numbers by sampling 1 value from each of the $\hat{N}$ sets without replacement. The simultaneous goals are to maximize $K$, maximize $\hat{N}$, and minimize $\sum_{k=1}^{K}$range (group k).

Given the multiple objectives, there isn't one best solution, but I'm looking for a way to enumerate the possibilities. For instance, if I demand that all groups have range $< R$, and that $K > 10$, what is the maximum achievable value of $\hat{N}$.

One approach I've toyed around with is simply binning the values in each set, and building a binary occupancy matrix $A_{i,j}$ which indicates whether set $j \in [1,N]$ has at least one number in bin $i$. Then the problem reduces to finding large rectangles of all $1$s in $A$. This is the same as finding maximal bicliques in a bipartite graph defined by adjacency matrix $A$. While this problem is NP-complete, I've found a reasonably fast algorithm that will give me a locally maximal solution quickly. This works, but binning the values isn't really what I want, because I'd like to be able to increase the acceptable range of values within each group, e.g. use a value from an adjacent bin as the others, to increase the total number of groups $K$.

I could also think about this as minimizing the sum of pairwise cost of placing value $V_i^j$ from set $j$ where $1 \leq i \leq N_j$ in the same group as $V_k^l$ from set %l$, which would be a similarly acceptable metric of within group "similarity", but perhaps is more amenable to graph theoretic approaches?

Hope this description is adequate, thanks in advance for any help/thoughts! First time posting here, apologies if I haven't tagged this correctly.