Choose lines that minimize minimum distance from points

117 Views Asked by At

Problem statement

Given $n$ points $\{x_i\}_{i=1}^n \in \mathbb{R}^p$, choose $m$ lines that minimize the sum of the squared distances between each point and the line closest to it. Assume $n \gg 2m$.

That is, find the lines that minimize

$$\xi = \sum_{j=1}^m \sum_{i \in \mathcal{D}_j}^n \left|r_{ij}\right|^2$$

where $r_{ij}$ is the distance between $x_i$ and the $j$th line, and $\mathcal{D}_j$ is the list of indices of the points closest to the $j$th line.

I think the search space has dimension $(2p-1)m$, because each of the $m$ lines must be described by $p$-vector indicating the slope and another $(p-1)$-vector indicating the intercept with an arbitrary plane.

  • Is there a clearer way to notate/formulate this problem?
  • Is the problem convex?
  • What is the dimension of the search space, and can it be reduced?
  • How would you solve this problem?

Ideas

Principal component analysis: In the $m=1$ case, the problem is total least squares regression. The optimal line is the line through the centroid and parallel to the eigenvector of the covariance matrix having the highest eigenvalue (the first "principle component").

For $m > 1$, a good guess would be to continue adding principle components, but this solution isn't optimal. Consider points arranged like these:

enter image description here

The first principal component is a diagonal line and the second principal component is orthogonal to that line, and $\xi > 0$. But we can achieve $\xi = 0$ by picking lines through the collinear points.

While the PCA lines aren't necessarily optimal, they might work well as seeds for an optimization routine that then varies the slopes and locations.

Require that all lines be parallel, assume optimal slope is known: Then it's just a matter of choosing multiple "centroids" through which the lines should pass, which reduces the search space to $(p-1)m$ variables. You could compute the projections of the points onto the space (whose dimension is $p-1$) orthogonal to the known slope, then use a clustering algorithm (e.g. $k$-means with $k=m$) to determine the centroids of the projections. Since optimal clustering (the facility location problem) is NP-hard, this suggests that the general problem is also NP-hard.

Require that all lines be parallel and evenly spaced, optimal slope unknown: Then there are even fewer variables: the slope, the placement of one line, and the spacing between it and the others. You could initialize the optimization process by picking the slope of the first principal component; then the residuals are 1-dimensional and the quantiles of their distribution can be used to pick evenly spaced offsets of the principal component line. Playing around, I think this problem is convex, but I haven't attempted to prove it.