Remove half the points to minimize variance

75 Views Asked by At

I have a set of $n$ points in $\mathbb{R}^d$ and I'm trying to find the subset of $\frac{n}{2}$ points with the smallest variance.

It can be shown that there exists a point in the set such that if we take the closest $\frac{n}{2}$ points closest to it we'll get a 2-approximation (i.e. a set with a variance at most twice the optimal) however I'm not interested in an approximation method but an optimal solution.

The obvious method to get an optimal solution would be to check any possible subset of size $\frac{n}{2}$ but then the time complexity would be exponential in $n$ and it won't be practical to run it.

So is there any way to solve this faster?

2

There are 2 best solutions below

0
On BEST ANSWER

Many comments said it sounds NP-hard but it's actually not (at least not in $n$, maybe it's NP-hard in $d$). We can solve it by doing something very similar to Voronoi Diagrams. We can pass an hyper plane between every pair of points which passes in the mean of them and is perpendicular to the line between them, each such hyper plane cuts the $R^d$ space into two parts one closer to the first point and one closer to the second points. That means all the reigons created by those hyperplanes (it can be shown to be $O\left(n^d\right)$ reigons) are the reigons that satisfies that regardless of which point you choose inside this reigon if you take the closest $n/2$ points to it (or any precentage of points fpr that matter not nesserily 50%) it will be the same $n/2$ points (since otherwise we would need to be in the other side of at least one of the hyperplanes). That means we can take an arbitrary point from each reigon compute the closest $n/2$ points and compute it's center of mass (mean) and the cost of the average, since the optimal point is in one of the reigons one of those closest $n/2$ points will be the same as in the optimal solution and meaning the optimal solution will be one of those centers of mass.

There is a naive way of computing those reigons that for constant $d$ will take $O\left(n^2\right)$ for each reigon (by computing the intersection of all the half spaces of this reigons) and this will give us a total running time of:

$$O\left(n^d\cdot n^2+n^d\cdot n\right)=O\left(n^{d+2}\right)$$

6
On

Variance in $\mathbb R^d$ is a sum of coordinate variances. For a sample of size $m$ and coordinate $x$: $$ m s_x^2 = \sum_i (x_i-\bar x)^2 = \sum_i x_i^2 - m\bar x^2 = \sum_i x_i^2 - m\left(\sum_i \frac{x_i}m\right)^2 = \sum_i x_i^2-\frac1m\sum_i x_i^2-\frac2m\sum_{i>j}x_ix_j=\\ \frac{m-1}{m}\sum_i x_i^2 -\frac2m\sum_{i>j}x_ix_j. $$

Thus, if you construct a triangular matrix: $$ \begin{pmatrix} (m-1)x_1^2 & -2x_1x_2 & -2x_1x_3 & \cdots& -2x_1x_n \\ 0 & (m-1)x_2^2 & -2x_2x_3 & \cdots & -2x_2x_n \\ \vdots&\vdots&\vdots&\ddots& \vdots\\ 0&0&0&\cdots&(m-1)x_n^2 \end{pmatrix} $$ then the goal is to keep $m$ rows/columns, so the sum is the least. If you have $d>1$, then you should add other coordinates to the matrix: $(m-1)(x_i^2+y_i^2+z_i^2)$. It would take $O(n^2d)$ to construct the matrix and $O(n^2)$ to remove $n/2$ largest rows/columns. It might still be impossible if $n=10^9$, but at least it's not exponential.