Optimization of a Fast Fourier Transformation-based correlation function?

50 Views Asked by At

As a forward, I don't have a lot of extensive mathematics knowledge. I'm a sophomore in CS, and I've taken up to Calc 2 (with some other random knowledge from watching youtube vids). I've done some preliminary research, but haven't found much aside from dense academic journals that I really don't understand.

I have two HUGE sets of time-based data that I need to correlate based on a time-delay. Specifically, I need to compute the lag required to maximize correlation of the two data sets. I'll be doing this analysis in Python.

The two sets of data are so large (almost 4 million rows of data in one of the sets), so I need to run this as fast as possible. I know how to do this brute-forcily and slowly, but I don't really know how to do it fast. Not only do I need to optimize the function in a mathematical sense (maximize correlation), but I need to optimize the function in a computational sense.

Since this is the mathematics exchange, I was hoping for some help regarding the mathematical side of optimizing the function, or at least to find a faster way than what I outline below.

Through some research, I've found that doing a Fast Fourier Transformation-based correlation is faster than running numpy.correlate.

Right now, my brute force method is this (using 24-hour windows of data):

  1. Starting with no time-delay between functions, calculate the correlation of the sets with the FFT function (then in future runs make the first shift the average of all shifts before it)
  2. Shift one set by x minutes, recalculate the correlation
  3. If the correlation is higher, shift up again by x minutes. If not, shift down by x*2 hour (then shift down by x minutes since you don't need to calculate the correlation of the 0 minute shift again) (Actually, I don't think this will work, so I'll probably have to look through every time delay in the entire -2 to +2 hour range)
  4. Repeat step 3 for the range of until a maximum is found in the range of -2 to +2 hours
  5. Save some information like the correlation, time delay, date range, etc. to a csv file
  6. Shift the window of data up by 30 minutes

Then repeat all of those steps for all of the data entries (2 years worth of data). The resolution of the data is on the scale of ~15 seconds, so it's quite a lot of data processing.

1

There are 1 best solutions below

0
On

Fourier Transform aside, you are misunderstanding what this type of correlation algorithm produces. I say this “type” because correlation from a probability and statistics perspective is something quite different. But for the problem here the result of the correlation calculation will return as many points as you have data. The correlation (array) represents how well the two series of data correlate from -N/2 to N/2 shifts, with zero shift being in the center. You make the computation ONCE and then look for a maximum in the result. The position of the maximum tells you the shift at which the two series correlate with each other the best. This correlation computed in the time/spatial domain is the brute force method is order (N^2) . With a million points that is 10^12 computations and you could consider just letting the school’s computer crunch those numbers if you only have to run this once. If you choose to use the FFT method a quick overview of the algorithm is this:

  1. Compute the FFT of series one.
  2. Compute the FFT of series two.
  3. Multiply the complex coefficients of series one times the complex conjugate (flip the sign of the imaginary part of the coefficient) of series two to form a new series.
  4. Inverse FFT the new series and strip off the imaginary part of the result. The result is (should be) an array real values but the algorithm will still return complex numbers.
  5. FFT Shift the data. This method returns the result with zero shift at zero index with negative shift wrapped round to the right part of the array. FFT shift will put the zero shift in the center of the array and make it much easier to understand. Positive shift to the right of zero and negative shift to the left.