Effectively extracting "real" data from a noisy dataset

33 Views Asked by At

Background & Motivation:

I have three lists of timestamps. They are not necessarily of the same size, however they are ordered. Each such list corresponds to a real-life instrument, of which there are three. The instruments in question record a time stamp each time they observe an "event". We may assume that their clocks are synchronized.

The issue is that the instruments are so sensitive, they record timestamps very frequently (on the order of $10hz$), and only a small portion correspond to actual events. It is difficult to put a number on precisely how many timestamps are real events (perhaps a handful per year).

Further, the instruments are close in proximity, so if one instrument "sees" and event, the other two almost certainly do as well. The clocks are precise enough to record the difference in observation time, even over miniscule differences.

The data is for one year, the same year for each of the three lists. We may assume that the "random" timestamps are uniformly distributed. There are, however, gaps in the data (e.g. maybe all three instruments were off for the months of March, April, May). The gaps will be the same for all three lists.


Goal:

Using only the timestamps, I want to attempt to find those which are "likely" to correspond to a "real" event, such that further analysis can be conducted. The "events" in question are light signals, so I can restrict my search to only those for which the difference in observation for some triplet of timestamps is less than the transit time between two observers.

My first inclination was to, from the three lists, produce a list of triplets, one timestamp contributed by each list, such that $\max(A,B,C) - \min(A,B,C)$ was minimized. Unfortunately, this found very few "triplets" which a) fell in the restriction mentioned previously and b) corresponded to "real" events. I mentioned previously that there are very few events, however I expect there to be far more than what this found.

I then tried doing the above, but minimizing the $\chi^2$ error, which I defined for some triplet $A,B,C$ (one from each list/instrument) as $(A-B)^2 + (A-C)^2 + (B-C)^2$. This found even fewer "triplets", and no real events.


Problem:

What techniques can be used to extract "coincidences" (the "real events") from a set of more-or-less uniformly distributed data, where there is far more random than "real" data? Here, we assume that I have access only to the timestamps.