How to select the columns of a binary feature matrix to obtain minimal overlap

237 Views Asked by At

Suppose you have a (rather sparse) binary matrix with $M$ rows and $N$ columns.
The columns represent objects, the rows represent features of the objects.
A $1$ in the matrix indicates that the corresponding object has the corresponding feature.

You can apply a logical 'or' to each row of the matrix, by which any row where at least one element is $1$ gives $1$, otherwise $0$.
Let vector $v$ of length $M$ be the result of the above operation applied to the matrix.
For the above defined 'initial' matrix, $v$ is a vector of $1$'s, i.e. each feature is found in at least one object.

The goal is to select $n \le \frac N 2$ columns of this matrix, such that if you calculate $v_1$ for the resulting submatrix (with only those selected columns), and $v_2$ for the submatrix of all the remaining columns, the scalar product $v_1 \cdot v_2$ (i.e. the 'overlap') is minimal.

In other words, you want to partition the set of $N$ objects into a set $S_1$ of size $n$ and its complementary set $S_2$ of size $N-n$, such that there are as few features of the objects in $S_1$ in common with the features of the objects in $S_2$.

How would you approach this theoretically / computationally?

I found a way to write this as a linear programming problem.
It works, but for $N > 5000$ it becomes too slow to be of any practical use. I might need to do this for $N$ in the order of $10^6$.

As far as approximate methods go, I tried to make the initial matrix as 'close to diagonal' as possible by sorting the columns, after which I scanned all possible sets of $n$ consecutive columns to find the one giving the smallest overlap.
I am not talking about a proper diagonal matrix, which I believe can only be square.
What I mean is, I tried to rearrange the columns in such a way to have contiguous sequences of columns with $1$'s 'more or less the same area' when you look vertically across the features.
It might work approximately and it is certainly faster than LP, but I am not sure I am taking the best approach.

Any suggestions / ideas?

Thanks!


EDIT: the linear programming formulation used previously, which seems to be superseded in terms of computational efficiency by the one suggested by @RobPratt below.

\begin{align} \min \sum_{i=1}^M y_{i,1}+y_{i,2} \\ \sum_{j=1}^N x_j = n \tag1 \\ \sum_{j=1}^N x_j a_{i,j} - y_{i,1} \sum_{j=1}^N a_{i,j} &\le 0 &&\text{for $i\in\{1,\dots,M\}$} \tag2 \\ \sum_{j=1}^N x_j a_{i,j} - y_{i,1} \sum_{j=1}^N a_{i,j} &\ge 1 - \sum_{j=1}^N a_{i,j} &&\text{for $i\in\{1,\dots,M\}$} \tag3 \\ \sum_{j=1}^N x_j a_{i,j} + y_{i,2} \sum_{j=1}^N a_{i,j} &\ge \sum_{j=1}^N a_{i,j} &&\text{for $i\in\{1,\dots,M\}$} \tag4 \\ \sum_{j=1}^N x_j a_{i,j} + y_{i,2} \sum_{j=1}^N a_{i,j} &\le 2 \sum_{j=1}^N a_{i,j} -1 &&\text{for $i\in\{1,\dots,M\}$} \tag5 \\ \end{align}

1

There are 1 best solutions below

12
On

Here is an integer programming formulation, perhaps different than yours. Let $a_{i,j}$ be the entry in row $i$, column $j$ of your matrix. For $j\in\{1,\dots,N\}$, let binary decision variable $x_j$ indicate whether object $j$ appears in $S_1$. For $i\in\{1,\dots,M\}$ and $k\in\{1,2\}$, let binary decision variable $y_{i,k}$ indicate whether feature $i$ appears in $S_k$. The problem is to minimize the quadratic objective function $\sum_{i=1}^M y_{i,1} y_{i,2}$ subject to linear constraints \begin{align} \sum_{j=1}^N x_j &= n \tag1 \\ x_j &\le y_{i,1} &&\text{for $i$ and $j$ such that $a_{i,j}=1$} \tag2 \\ 1-x_j &\le y_{i,2} &&\text{for $i$ and $j$ such that $a_{i,j}=1$} \tag3 \end{align} You can relax $y_{i,k}\in\{0,1\}$ to $y_{i,k}\ge 0$, and doing so might improve the solve time. The resulting problem is mixed integer quadratic programming (MIQP).


You can also linearize as follows. Introduce binary decision variable $z_i$ to represent the product $y_{i,1} y_{i,2}$. The problem is to minimize the linear objective function $\sum_{i=1}^M z_i$ subject to linear constraints $(1)$, $(2)$, $(3)$, and \begin{align} y_{i,1} + y_{i,2} - 1 &\le z_i &&\text{for $i\in\{1,\dots,M\}$} \tag4 \end{align} Again, you can relax $z_i\in\{0,1\}$ to $z_i \ge 0$ if it helps. Now use a mixed integer linear programming (MILP) solver.


An idea for possible improvement is to aggregate $(2)$ and $(3)$ (separately) to reduce the number of constraints. This will weaken the LP relaxation but speed up the underlying LP solves so might yield an overall improvement in solve time. \begin{align} \sum_{j:a_{i,j}=1} x_j &\le \left(\sum_j a_{i,j}\right) y_{i,1} &&\text{for all $i$} \tag{2$'$} \\ \sum_{j:a_{i,j}=1} (1-x_j) &\le \left(\sum_j a_{i,j}\right) y_{i,2} &&\text{for all $i$} \tag{3$'$} \end{align} OK, this turns out to be equivalent to your $(2)$ and $(4)$. So maybe try your formulation without your $(3)$ and $(5)$.


To instead maximize $\sum_i z_i$, you can omit $y_{i,k}$ and replace $(2)$, $(3)$, and $(4)$ with: \begin{align} \sum_{j:a_{i,j}=1} x_j &\ge z_i &&\text{for all $i$} \tag{2$''$} \\ \sum_{j:a_{i,j}=1} (1-x_j) &\ge z_i &&\text{for all $i$} \tag{3$''$} \end{align}