I'm trying to select an optimal subset of some items. I've tried 2 optimal approaches (branch-and-bound and integer programming) but both proved impossible for the size of the problem. I'm wondering if anyone has any suggestions for a better way to approach the problem, perhaps a suitable approximate algorithm?
In more detail, I have an $M \times N$ matrix $C$ which indicates the compatibility between a number of items and a number of scenarios. I want to be able to select a subset of $m$ items (i.e. $m$ of rows from $C$) while maximizing the compatibility of every scenario with at least one selected item (i.e. the columnwise maximum in the subset).
To formalise it a bit, lets say I have a variable $x_i$ for each row, which indicates if it's part of the chosen subset. Then I'm trying to maximize the sum of maxs: $$ \text{arg}\max_x \sum_{j=1}^{N} \max_{i} x_{i}C_{i,j}$$ subject to$$x_i \in \{0,1\}$$ $$\sum_{i=1}^{M} x_{i}= m$$
This point is where I wrote a branch and bound system in python. I started at the root node with the full matrix $C$. Then I branched by removing every row in turn. The bounds for each branch were computed as the sum of columnwise maxima for the remaining matrix on each branch (without picking a subset).
Unfortunately, for reasonable sized matrices (i.e. $N=300$) the number of branches is enormous. So next I reformulated it as an integer programming problem in matlab.
I removed the max operation from the cost function by introducing slack variables $s_j$. $$\text{arg}\max_{x,y,s}\sum_{j=1}^{N} s_j$$
An inequality constraint is used to limit the slack variable for each row to the columnwise max of the chosen subset (indicated by $y_{i,j}$). If the row is not chosen, a large number $M$ (outside the range of $C$) is added to "disable" the constraint. $$s_j \leq x_{i}C_{i,j}+M(1-y_{i,j})$$ $$y_{i,j} \in \{0,1\}$$ $$\sum_{i=1}^{M} y_{i,j}= 1$$ $$y_{i,j} - x_{i} \leq 0$$ The last 3 constraints ensure that there is one columnwise max ($y_{i,j}$) activated per column, and that $y_{i,j}$ can only be activated if row $i$ is selected (i.e. $x_i$ is activated).
So I end up with 2 lots of integer constraints, 2 lots of equality constraints, a tonne of inequality constraints and a very simple cost function. For small problem sizes this system works, but again it does not scale up to several hundred items, despite all the fancy heuristics and optimizations that Matlab uses. At this point I've kind of given up hope that I can find an optimal solution, but I'm not sure what would be the best way to find an approximate solution.
There are many heuristic methods that you could try for discrete optimization (e.g. simulated annealing, particle swarm, ant colony,etc...). One nice thing about heuristics, is that you may express the problem in many different ways, which often can have a significant effect in performance. I have had a good experience with genetic algorithms, Matlab includes a pretty decent one.
If heuristics are not performing well (how can you measure this?) then deterministic methods are very likely going to have a harder time. That said, commercial solvers like Gurobi or CPLEX, which are free for academic research, are extremely more efficient than other software. CPLEX in particular has indicator constraints which can be quite useful for your problem.