As a forward, I don't have a lot of extensive mathematics knowledge. I'm a sophomore in CS, and I've taken up to Calc 2 (with some other random knowledge from watching youtube vids). I've done some preliminary research, but haven't found much aside from dense academic journals that I really don't understand.
I have two HUGE sets of time-based data that I need to correlate based on a time-delay. Specifically, I need to compute the lag required to maximize correlation of the two data sets. I'll be doing this analysis in Python.
The two sets of data are so large (almost 4 million rows of data in one of the sets), so I need to run this as fast as possible. I know how to do this brute-forcily and slowly, but I don't really know how to do it fast. Not only do I need to optimize the function in a mathematical sense (maximize correlation), but I need to optimize the function in a computational sense.
Since this is the mathematics exchange, I was hoping for some help regarding the mathematical side of optimizing the function, or at least to find a faster way than what I outline below.
Through some research, I've found that doing a Fast Fourier Transformation-based correlation is faster than running numpy.correlate.
Right now, my brute force method is this (using 24-hour windows of data):
- Starting with no time-delay between functions, calculate the correlation of the sets with the FFT function (then in future runs make the first shift the average of all shifts before it)
- Shift one set by x minutes, recalculate the correlation
- If the correlation is higher, shift up again by x minutes. If not, shift down by x*2 hour (then shift down by x minutes since you don't need to calculate the correlation of the 0 minute shift again) (Actually, I don't think this will work, so I'll probably have to look through every time delay in the entire -2 to +2 hour range)
- Repeat step 3 for the range of until a maximum is found in the range of -2 to +2 hours
- Save some information like the correlation, time delay, date range, etc. to a csv file
- Shift the window of data up by 30 minutes
Then repeat all of those steps for all of the data entries (2 years worth of data). The resolution of the data is on the scale of ~15 seconds, so it's quite a lot of data processing.
Fourier Transform aside, you are misunderstanding what this type of correlation algorithm produces. I say this “type” because correlation from a probability and statistics perspective is something quite different. But for the problem here the result of the correlation calculation will return as many points as you have data. The correlation (array) represents how well the two series of data correlate from -N/2 to N/2 shifts, with zero shift being in the center. You make the computation ONCE and then look for a maximum in the result. The position of the maximum tells you the shift at which the two series correlate with each other the best. This correlation computed in the time/spatial domain is the brute force method is order (N^2) . With a million points that is 10^12 computations and you could consider just letting the school’s computer crunch those numbers if you only have to run this once. If you choose to use the FFT method a quick overview of the algorithm is this: