The question is related to an engineering application I am writing.
We are computing convolutions of large amounts of data with a few kernels with bounded support. The standard way to do so is to perform Fast Fourier Transform (FFT) on both the kernel and the data, multiply the arrays, and inverse FFT.
However, the amount of data is huge and it's not feasible to perform FFT on the entire data. So we do computation in sliding windows: start with a window of size S+W+S, where S is the size of the kernel support, do FFT+multiply+inverseFFT on the S+W+S amount of data, record the results for a smaller window of size W, and slide by W further along the data array. Since S is not negligible compared with W we compute more results than we can use.
I have a question though: are there any known shortcuts to reduce the amount of computation? For example, is there such thing as "partial inverseFFT" that would be enough to produce results only for the smaller window of size W? Anything else we can reduce to avoid computing the values that we know are going to be thrown away?
I am not sure this is practical in the general case. If you look at the structure of the (I)FFT, every input influences every output. If you only need some outputs and use the decimation-in-frequency form you could skip some computations but the improvement appears to be only log(k), where k is the fraction of outputs you want. In the limit, if you only want one output (e.g., the DC component) you'd reduce the FFT complexity from O(N log N) down to O(N), which make sense since averages can be computed with complexity O(N). This still might be worth it if you only want a tiny fraction of the output.
In theory you could write a special (I)FFT to do this, but you probably won't outperform a widely used package like FFTW3. And it's so heavily tuned that I wouldn't think of trying to modify it.
But there's one trick you might do. If your convolution kernel represents a low pass (or bandpass) filter, you can use a small IFFT that covers only the part of the spectrum you want.
I do this in my software defined radio. I take a relatively wide digital IF (IQ) signal, select one part and downsample to a more reasonable output sample rate. So I perform a large forward FFT, multiply the frequency domain components I want by those of my filter, and use a small IFFT to produce my output. This automatically downsamples by the ratio of the two transform sizes.
It's not quite this simple. If I did a brickwall filter by just performing the IFFT directly on a subset of frequency bins, the equivalent impulse response would be very long and I'd get wraparound aliasing in the time domain (remember, FFT-multiply-IFFT actually performs circular convolution). The impulse response of my filter must not exceed M samples, where N = L + M - 1 is the length of the FFT and L is the number of new samples for each FFT.
Since I'd still like to approximate a brickwall, I start with one and do an IFFT to the time domain (which gives me a sinc pulse) and multiply by a sampling window (Kaiser, since it has a nice tuning knob). Then I side the impulse to the front of the buffer to minimize latency and do a FFT to get my filter in the frequency domain. If I've chosen a good Kaiser parameter, the coefficients outside the frequency range will still be nearly zero but those in the passband but close to the edge will roll off in a controlled way to bound the time-domain impulse response.
I could probably save some time on the (forward) FFT by doing a bunch of small ones instead of one big one and feeding a tree of FFTs that essentially implement a series of decimation filters. In a sense, this is doing the FFT you want that avoids unneeded calculations. But I need to find out how long the impulse response of each filter needs to be; if they're short, a hand-tuned direct FIR implementation might be more efficient. But I've found it's really hard to beat FFTW3 these days, since it's so highly tuned.