FFT ELI5 - I mostly get it, but not quite - is my understanding of frames/windows off?

128 Views Asked by At

tl/dr: I've got two audio recordings of the same song without timestamps, and I'd like to align them. I believe FFT is the way to go, but while I've got a long way, it feels like I'm right on the edge of understanding enough to make it work, and would greatly benefit from a no-formula understanding of FFT. (My education never got into this area) So I came here from StackOverflow seeking ELI5 help.

The journey:

  1. Get two recordings at the same sample rate.
  2. Transform them into a waveform. (DoubleArray) This doesn't keep any of the meta info like "samples/second" but the FFT math doesn't care until later.
  3. Run a FFT on them using a simplified implementation
  4. Get a Array<Frames<Bin<amplitude, frequency>>> because the older implementation hid all the details (like Frame width, and number of Bins, and ... stuff?) and output words I was familiar with like "amplitude" and "frequency"
  5. Try moving to a more robust FFT (Apache Commons)
  6. Get an output of 'real' and 'imaginary'
  7. Make the totally incorrect assumption that those were the same thing (amplitude and frequency). Surprise, they aren't! I expected something like SortedSet<ChunkOfTime<Bin<frequency, amplitude>>>, yikes was I wrong.
  8. Apache's FFT returns an Array<Complex> which means it... er... is just one frame's worth? And I should be chopping the song into 1 second chunks and passing each one into the FFT and call it multiple times? That seems strange, how does it get lower frequencies?

To the best of my understanding, the complex number is a way to convey the phase shift and amplitude in one neat package (And you need phase shift if you want to do the FFT in reverse). And the frequency is in the index of the array.

Which works out to (pseudocode)

fftResult
  .filterIndexed { index, _ -> index < fftResult.size/2 } // only the lower buckets are useful
  .forEachIndexed { idx, complex ->
    phases[idx] = atan2(complex.imaginary, complex.real) // Not really needed for me, but good if you want to go backwards
    frequencies[idx] = idx * 44100.0 / fftResult.size // Uses the input sample rate
    amplitudes[idx] = sqrt(complex.real * complex.real + complex.imaginary * complex.imaginary)
}

Is my step #8 at all close to the truth? Why would the FFT result return an array as big as my input number of samples? That makes me think I've got the "window" or "frame" part wrong.