Let $v_1,\ldots,v_N$ be linear independent unit vectors in $\mathbb{R}^N$ and denote their scaled sum by $s_N = \frac{1}{N}\sum_{k=1}^N v_k.$ I would like to find a small subset of size $n$ among those vectors such that their scaled sum approximates $s_N$ well. In other words find
$$ J = \underset{J\in\mathscr{J}}{\operatorname{argmin}} \bigg\lVert s_N - \frac{1}{n}\sum_{k=1}^n v_{J_k}\bigg\rVert$$
where $J$ runs over the set $\mathscr{J}$ of all subsets of $\{1,\ldots,N\}$ with size $n$ and $\lVert \cdot \rVert$ is the euclidean norm.
The set of vectors can be considered an iid sample drawn uniformly from the sphere. And, of course, in my case $N$ and $n$ are too large ($N$ will be of the order of 10'000 or 100'000 and $n$ maybe one or two magnitudes smaller) to just try all subsets. So I am looking for something more clever.
My approach so far
I tried
- Repeated random subsampling, i.e. drawing many, many subsets of size $n$ in an iid fashion, calculating the approximation for each instance and retaining the best.
- Greedy approach, starting with a single vector, and then increasing the set in steps every time by a single vector. The vector is that single vector which gives the best approximation for the enlarged set.
Questions
- Is this a known problem with a proper name?
- Is it hard (as in NP-hard for example) or are clever solutions known?
- Are there better heuristic approaches?
- Are there theoretic results/performance guarantees for the two heuristics I used?
Note: I edited the question to include scaling. Some of the answers/comments refer to the older version where vectors were not scaled.
As suggested by @BenGrossmann, you can use integer linear programming to minimize the 1-norm. Explicitly, let binary decision variable $x_j$ indicate whether $j \in J$. The problem is to minimize $\sum_{i=1}^N (z_i^+ + z_i^-)$ subject to linear constraints \begin{align} (s_N)_i - \frac{1}{n}\sum_{j=1}^N (v_j)_i x_j &= z_i^+ - z_i^- &&\text{for $i \in \{1,\dots,N\}$} \\ \sum_{j=1}^n x_j &= n \\ z_i^+ &\ge 0 &&\text{for $i \in \{1,\dots,N\}$}\\ z_i^- &\ge 0 &&\text{for $i \in \{1,\dots,N\}$} \end{align}
This might provide a good approximation for your 2-norm objective or a good starting solution for an improvement heuristic.
For the 2-norm, the problem is to minimize $\sum_{i=1}^N \left((s_N)_i - \frac{1}{n}\sum_{j=1}^N (v_j)_i x_j\right)^2$ subject to linear constraint $$ \sum_{j=1}^n x_j = n \tag1 $$
Because $x_i$ is binary, we have $x_i^2 = x_i$. For $i < j$, you can linearize each product $x_i x_j$ as described here.
You can also strengthen the formulation by multiplying both sides of the cardinality constraint $(1)$ by $x_i$, yielding: $$\sum_{j=1}^{i-1} x_j x_i + x_i^2 + \sum_{j=i+1}^n x_i x_j = n x_i$$ And then linearize this quadratic constraint by using the products from the objective linearization: $$\sum_{j=1}^{i-1} y_{j,i} + \sum_{j=i+1}^n y_{i,j} = (n - 1) x_i$$
Given a feasible solution, a simple improvement heuristic for either norm is to replace one vector in $J$ with one vector not in $J$ if it improves the objective value.