detecting dynamic parts in graph

51 Views Asked by At

I have set of (x,y) points which can be connected to form a graph, my goal is to detect dynamic parts of this graph. by dynamic I mean ranges where the values are not stable but they are changing by going up/down and forming different graphs than a straigh y = C line.

I tried to look at the slopes graph and look at the parts which has slope > EPSILON. But I noticed that this approach depends on the density of the points, if 2 sets of points depict the same graph, one is more dense then its slopes values will be lower ( the change between 2 consecutive points isn't noticeable now ).

How can I detect such areas from the points without depending on the points number used to build the graph ??

Here is an example of the data I am processing :

enter image description here

I want to be able do detect the dynamic ranges in this graph without being dependent on the density of the points given to describe the same graph ( the more points --> difference between 2 consecutive points "y" values become lower ... )

In this graph we can see that a static part prevails in the beginning and in the end, and in the middle there is a good dynamic range...

2

There are 2 best solutions below

3
On

To be sure I'm answering what you're asking:

Your $x_i$ and $y_i$ values might be, say, dates and temperatures on those days, but you only get to read the thermometer now and then, so that (calling Jan 1 = 1, Jan 2 = 2, Feb 1 = 32, etc.), you have data like

(1, 12)

(2, 11)

(15, 13)

(22, 14)

(29, 13)

(33, 18)

(39, 25)

and so on. You'd like to identify most of January ($x = 1$ through $x = 29$) as "constantly low temp" but february as a warming trend.

Assuming that you've ordered the points so that you have $(x_1, y_1), (x_2, y_2), \ldots$, where the $x_i$ are increasing (as I did above) and then drawn the graph, then you're looking at some point $$ (x_i, y_i) $$ and asking "is the graph increasing faster than $\epsilon$ here? One decent way to remove the dependence on the spacing of $x$ is this. Let's make it concrete and say you're looking at $(x_4, y_4)$. You compute $y_5 - y_4$ and find it's larger than $\epsilon$, but then realize that this is true becuase $x_5$ is much larger than $x_4$. The usual solution is to instead compute $$ d_4 = \frac{y_5 - y_4}{x_5 - x_4}, $$ which could be called the "forward difference estimate of the derivative." The "backward difference estimate" would look at the prior rather than next point: $$ b_4 = \frac{y_4 - y_3}{x_4 - x_3}, $$ And the average of the two also makes some sense, as does a "symmetric" version, where you ignore $x_4$ and $y_4$, but instead look at $$ s_4 = \frac{y_5 - y_3}{x_5 - x_3}. $$

I'd suggest looking at each of these in the context you're examining and see how things look.

0
On

Given Foad's response to my earlier answer, I'm going to try to re-state and answer his question here in a second response, rather than extended comments.

Problem: There's an unknown function $F : [0, 500] \to \mathbb R$; we are given samples $y_i = F(t_i)$ of $F$ at times $t_0 = 0, t_1 = b, t_2 = 2b, \ldots, t_i = i\cdot b, \ldots, t_n$, where $n$ is approximately $500/b$. We may assume that non-constancies of $F$ occur at a scale substantially larger than $b$. We'd like to identify points $t_i$ at which $F$ is near-constant, independent of the value $b$.


From the number of samples, you can get a decent estimate of $b$ (namely, $b \approx 500/n$. (It's possible, too, that in your context you're actually given the value of $b$.)

Then compute, for instance, $$ d_i = \frac{y_{i+1} - y_{i}}{b} $$ When $d_i$ is small (less than some constant $\epsilon$ that you choose), we can say $F$ is nearly constant; when $d_i$ is large, $F$ is varying. This is pretty crude, but it is, at least, more or less independent of the spacing, $b$.

Another possible choice is "Compute the variance $v_i$ of the numbers $y_{i-k}, \ldots, y_{i+k}$ for some small $k$" (the "window size" is then $2k+1$), but this has the disadvantage that if you double the number of samples (i.e., cut the value $b$ in half), you end up examining a different period of time.

Better by far is to pick a time-span $\Delta t$, and compute $k = \dfrac{\Delta t}{2b}$, and then look at the variance of the samples $y_{i-k}, \ldots, y_{i+k}$. If $b$ doubles, $k$ will be half as large, and you'll end up looking at fewer samples, but they'll correspond to (approximately) the same time interval as your previous ones. Even if you take this latter approach, the results for different $b$ values will not be identical: the sample variance for $20$ samples will be difference than the sample variance for $10$...but my guess is that in your particular problem, this will not be significant.

To summarize:

First, and once and for all, pick a time-interval over which you expect variation to be significant. Let's say that's $\Delta t = 5 ms$.

Estimate $b \approx 500 / n$ (or use $b$ if it's given to you).

Compute $k = \dfrac{b}{2 \Delta t}$.

For each $i > k$,

(i) Let $m_i = \dfrac{ \sum_{j = i-k}^{j = i+k} y_j} {2k+1}$.

(ii) Let $v_i = \frac{1}{2k+1} \sqrt{ \sum_{j = i-k}^{j = i+k} (y_j - m_i)^2}$.

(iii) if $v_i$ is larger than some threshold, report that $F$ is varying at $x_i$; else report that it's approximately constant.