Robust matching of values to a uniform sequence

53 Views Asked by At

I have a sequence of $N$ real numbers $X_i$ that are deemed to be equally spaced. $N$ is small (say $10$ to $20$).

Two problems can arise:

  • the spacing isn't perfectly uniform, it can vary in a smooth way, possibly causing a shift of, say up to $2$ points compared to the ideal arrangement;
  • there can be a few (say up to $3$) missing or extra points.

Do you have suggestions on how to efficiently number the points to get a "best correspondence" ?

In the example figure, the given positions are compared to the ideal ones (in blue). There is a displaced point ($5$) and a missing one ($9$). The expected output of the algorithm is $(1,2,3,4,5,6,7,8,10,11)$. Don't be confused by the plot, it is a 1D problem, all points should be projected on the vertical axis.

enter image description here

A possible solution could be to fit a low order polynomial, say cubic, and try different hypothesis of missing or extra points (like when computing a string edit distance), but that will be time consuming. I am looking for a simpler solution.

Note that just associating the points to their nearest ideal neighbor can fail because of the shift (in the given example, $5$ is closer to the ideal $4$ than to the ideal $5$).

Update:

The $x_i$'s are sorted. Because of possible insertions/deletions, the desired number of correspondences is a given $M$ which can differ from $N$. When $N<M$, there will be associations to "none".

For convenience you can assume the values to have been normalized to the range $[0,M-1]$, so that the ideal sequence is just given by $i$.

I acknowledge that the problem is no perfectly specified, as there can be ambiguous cases: for instance, when the given points are uniformly spaced but one is missing, there are several equally good assignments.

2

There are 2 best solutions below

1
On

My attempt: Let $(x_i)$ be the sequence of points you have. I would first order this sequence (if it is not ordered yet). Now I would do something like:

xs // list of given x_i, ordered
ys // final list

for i in 0 ... len(xs):
    if i == 0:
         ys[0] = round(xs[i])
    else:  
         ys[i] = max( round(xs[i]), ys[i-1]+1 )
    endif
endfor

So the final sequence $(y_i)$ is defined recursively via

$$\begin{align} y_0 & = \operatorname{round}(x_0) \\ y_{n+1} &= \max\{ \operatorname{round}(x_{n+1}), y_n + 1 \}\end{align}$$

From this you might find a more memory-efficient or time-efficient algorithm. Because you have to go over the (ordered) sequence $x_i$ only once the algorithm above needs time linear to the length of $(x_i)$.

0
On

My own attempt:

In the ideal sequence, the distance between consecutive points is $1$. When the density increases or decreases locally, it changes to $1\pm\epsilon$. When a point is displaced, the distance becomes $1-\epsilon$ on a side and $1+\epsilon$ on the other. When one or more point is missing, the distance jumps to about $2,3\cdots$ When one or more extra points are inserted, at least one of the distances drops below $0.5$. So we are looking for evident outliers.

So the idea is to detect distances below $1-\alpha$ and above $1+\beta$, for two well chosen thresholds, and

  • discard one of the endpoints of the small intervals. To decide which endpoint, we can choose the endpoint that has the closest (other) neighbor;
  • insert dummy points in the large intervals; to decide then number of points to be inserted, we can round the distance to the next integer.

When all "abnormal" distances have been processed, we end up with $N$ points left. If $N=M$, we are fine, otherwise the sequence is declared "unmatchable".

As a variant, we can also process the abnormal intervals by decreasing order of abnormality, until we obtain $M$ points.

As another variant, we can make the method adaptive by estimating the local density, and relating the distances to the local estimate. A robust estimator could be the median of, say 5 or 7 distances in a sliding window or at a few points. The local estimator can be interpolated or approximated by fitting a low degree polynomial.