compare lines and recognize similar ones

953 Views Asked by At

how can I find similar patterns in a line if I got a "template-line"?

enter image description here

In this example, if I got the template (red), how can I find out that there are two occurences in the green one? The lines don't match exactly and the min/max values can be different, but the type is similar. Is there a way/algorithm to achieve this?

I do have the X and Y values of all these points, if that helps.

2

There are 2 best solutions below

2
On

If the width of the test function will not be scaled then a simple method involves calculating the cross-correlation. The cross-correlation $(f\star g)(t)$ of two functions $f$ and $g$ quantifies how similar the two functions are for a given offset $t$.

Cross-correlation can be calculated using a convolution. You can take your test function $f(t)$, say, and convolve $f(-t):=h(t)$ with your data set function $g(t)$, wherever the test function "occurs" you will get a peak. By applying a suitable threshold (which will heavily depend on the data) you can detect occurances of $f$. I.e. occurances are indicated by $$ (f\star g)(t) = (h * g)(t) > H$$ for a suitable $H$.

Here is a very simple example where I have assumed that $f$ and $g$ are sampled at the same intervals. This is our test function: Test function This is our data: Data Convolving them together highlights where the two curves look similar. Convolution

Here is another example with a test function which is not an odd function. One can see that the method is relatively robust under the addition of significant noise.

As said above this will not work well if the width of the test function is also being scaled. In this case I would look at wavelet transforms, they could be just what you're looking for.

0
On

Assuming that it is known (from some side information) that there can be no more than $1<M<\infty$ occurencies of $s(t)$, the problem may be posed as an instance of an (multiple) hypothesis testing problem: Given $y(t)$, $t \in [0, T] $, we must decide which of the $M$ hypotheses $\mathcal{H}_m$, $m \in \{1,\ldots,M\}$ has occurred, where \begin{equation} \mathcal{H}_m: y(t) = \sum_{i=1}^m a_i s(t-\tau_i) + n(t), m \in \{1,\ldots,M\} \end{equation} is the hypothesis that there are $m$ occurrences of $s(t)$, where $\{\tau_i\}_{i=1}^m$ are the delays of the occurrences of $s(t)$ ($0 < \tau_i < \tau_j <T_y$, for $i<j$), $\{a_i\}_{i=1}^m$ are the corresponding gains (scalings) ($a_i \in \mathbb{R}$) and $n(t)$ is noise. Since we are only interest on which hypothesis occurs and not on the parameters $\{\tau_i\}_{i=1}^m$ and $\{a_i\}_{i=1}^m$, the latter are typically referred to as "nuisance parameters" as their presence complicates the problem.

In order to proceed, we need to make assumptions on the statistical properties of $n(t)$. A common and mathematically convenient choice is to consider (assume) that $n(t)$ is a sample of a zero mean white Gaussian process of power $N_0/2$.

Assume for a moment that we know the nuisance parameters, i.e., we know that if hypothesis $\mathcal{H}_m$ has occurred, this can only be so with the known parameter values $\{\tau_i\}_{i=1}^m$ and $\{a_i\}_{i=1}^m$. Note that if there was no noise (i.e., $N_0=0$) this would mean that we know which hypothesis occurred by simple inspection of $y(t)$. However, the presence of noise introduces uncertainty (what may look like a certain hypothesis may be deceiving due to a "bad" noise realization). Given the Gaussian noise assumption, and skipping technical details, one commonly employed decision rule is the following:

The maximum likelihood decision rule $\hat{\mathcal{H}}_m$ (or simply $\hat{m}$) is \begin{equation} \hat{m}= \arg \max_{m \in \{1,\ldots,M\}}\ \mathcal{L}(y(t)|\mathcal{H}_m,\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m), \end{equation}

where $\mathcal{L}(y(t)|\mathcal{H}_m,\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m)$ is the likelihood functional of hypothesis $m$, given as \begin{equation} \mathcal{L}(y(t)|\mathcal{H}_m,\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m) \triangleq \exp \left\{\frac{2}{N_0} \sum_{i=1}^m a_i\int_0^{T}y(t)s(t-\tau_i)dt - \frac{1}{N_0} \sum_{i=1}^m\sum_{j=1}^m a_i a_j \int_0^{T}\int_0^{T}s(t-\tau_i)s(u-\tau_j)dtdu\right\} \end{equation}

Note that the second term of the exponential can be pre-computed, meaning that only the correlation of $y(t)$ with the shifted versions of $s(t)$ is required for the computation of the likelihood functional. The correlation is typically performed by discrete-time linear filtering and the overall complexity is relatively small.

Now, in order to take into account the presence of nuisance parameters, we need to make (further) assumptions on their statistical properties. One approach (the Bayesian) is to assume a certain distribution (pdf) for the nuisance parameters and employ the same decision rule as above, this time using the parameter-averaged likelihood functional $\mathcal{E}(\mathcal{L}(y(t)|\mathcal{H}_m,\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m))$, where $\mathcal{E}(\cdot)$ denotes expectation over the pdf of $\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m$.

Another approach is to assume no statistical properties for the nuisance parameters. In that case, one commonly employed decision rule (generalized maximum likelihood) is to use the same decision rule as above, this time using the parameter-maximized likelihood functional, i.e., $\max(\mathcal{L}(y(t)|\mathcal{H}_m,\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m))$ where the maximization is over $\{\tau_i\}_{i=1}^m,\{a_i\}_{i=1}^m$. Note that, for the "winning" hypothesis $\hat{m}$, the maximizing nuisance parameters values can be considered as estimates of the delays and scalings for the $\hat{m}$ occurrences of $s(t)$.