Finding a cluster of size $k$ with minimum intra-distance

56 Views Asked by At

I've been looking for an algorithm for the following problem: given a set of $N$ points $x_1, \ldots, x_N$ in $\mathbb{R}^d$ and an integer $k \leq N$, find a set of $k$ points that minimizes its intra-distance, i.e., a set $S \subseteq [N]$ of size $k$ such that $\sum_{i, j \in S} d(x_i,x_j) $ is minimal among all sets of this size.

This looks closely related to clustering, with the difference that the goal is to obtain a single tightly-packed subset of points rather than a partition of the entire set of points. However, I wasn't able to find any algorithms well-suited for this task; all the variants I managed to find either try to find a partition with prescribed sizes (such as this) or try to find a single subset, but with unconstrained size.

While iterating over all $k$-subsets yields an $O(N^k)$ time algorithm, I imagine there is a more efficient way to go about this (algorithms without proven runtime bounds would be fine, too). Is a solution to this problem known?

2

There are 2 best solutions below

2
On

You can solve the problem via integer linear programming as follows. Let binary decision variable $y_i$ indicate whether $i \in S$. The problem is to minimize $$\sum_{i<j} d(x_i,x_j) y_i y_j \tag1$$ subject to $$\sum_i y_i = k \tag2$$ You can linearize the objective $(1)$ by introducing binary (or just nonnegative) variables $z_{i,j}$ and minimizing $$\sum_{i<j} d(x_i,x_j) z_{i,j}$$ subject to $(2)$ and $$z_{i,j} \ge y_i + y_j - 1 \quad\text{for all $i<j$} \tag3$$ Constraint $(3)$ enforces the logical implication $y_i \land y_j \implies z_{i,j}$.

0
On

Another strategy that is asymptotically $O(N^k)$ but will probably perform much faster in practice is $A^*$ search. One admissible heuristic is the number of points left to add times the cost of adding the greedily least expensive point. Even tighter heuristics probably also exist.