FFT performance using NumPy, PyFFTW, and cuFFT
Problem statement
Suppose we want to calculate the fast Fourier transform (FFT) of a two-dimensional image, and we want to make the call in Python and receive the result in a NumPy array. The easy way to do this is to utilize NumPy’s FFT library. Now suppose that we need to calculate many FFTs and we care about performance. Is NumPy’s FFT algorithm the most efficient? NumPy doesn’t use FFTW, widely regarded as the fastest implementation. The PyFFTW library was written to address this omission. FFTs are also efficiently evaluated on GPUs, and the CUDA runtime library cuFFT can be used to calculate FFTs.
For the 2D image, we will use random data of size n × n
with 32 bit floating point precision
image = np.random.random(size=(n,n)).astype(np.float32)
We would like to compare the performance of three different FFT implementations at different image sizes n
.
Hardware
The relative performance of the CPU and GPU implementations will depend on the hardware being using. This is the hardware I’m using to produce the results in this post:
CPU | AMD Ryzen 2700X (8 core, 16 thread, 3.7 GHz) |
GPU | NVIDIA RTX 2070 Super (2560 CUDA cores, 1.6 Ghz) |
Measuring runtime performance
To measure the runtime performance of a function, we will define time_function
that calls the function many times, similar to IPython’s timeit
.
The time_function
is written such that it will not run longer than 0.1 seconds or the runtime of a single function call.
def time_function(func, runtime=.1):
"""Time a function by running it repeatedly for at least 'runtime' seconds"""
start = timer()
t = 0
count = 0
while t < runtime:
t0 = timer()
func()
tf = timer()
t += tf - t0
count += 1
return t/count
NumPy implementation
In NumPy, we can use np.fft.rfft2
to compute the real-valued 2D FFT of the image:
numpy_fft = partial(np.fft.rfft2, a=image)
numpy_time = time_function(numpy_fft)*1e3 # in ms
This measures the runtime in milliseconds. This can be repeated for different image sizes, and we will plot the runtime at the end.
PyFFTW implementation
In PyFFTW, we have to create pyFFTW.empty_aligned
arrays to store the image and the FFT output.
The FFT is performed by calling pyfftw.FFTW
on these arrays, and the number of threads can also be specified here (I choose 16 in my case):
### Prepare the image for PyFFTW
a = pyfftw.empty_aligned((n,n), dtype='float32')
b = pyfftw.empty_aligned((a.shape[0], a.shape[1]//2 + 1), dtype='complex64')
a[...] = image
fft_object = pyfftw.FFTW(a, b, threads=16) # specify number of CPU threads to use
pyfftw_time = time_function(fft_object)*1e3 # in ms
cuFFT implementation
Finally, we can compute the FFT on the GPU.
This can be done entirely with the CUDA runtime library and the cufft
library.
In C++, the we can write the function gpu_fft
to perform the FFT:
void gpu_fft(const int dataH, const int dataW, const int iterations) {
/**
* @param dataH image height
* @param dataW image width
* @param iterations number of FFT iterations
**/
// prepare the data
float *h_Data;
float *d_Data;
cuComplex *d_DataSpectrum;
h_Data = (float *)malloc(dataH*dataW * sizeof(float));
cudaMalloc((void **)&d_Data, dataH*dataW * sizeof(float));
cudaMalloc((void **)&d_DataSpectrum, dataH * (dataW / 2 + 1) * sizeof(cuComplex));
// generate random image
for (int i = 0; i < dataH * dataW; i++)
{
h_Data[i] = getRand();
}
cudaMemcpy(d_Data, h_Data, dataH*dataW * sizeof(float), cudaMemcpyHostToDevice);
// create cuFFT plan
cufftHandle fftPlanFwd;
cufftPlan2d(&fftPlanFwd, dataH, dataW, CUFFT_R2C);
// compute iterations of FFT
for (int i = 0; i < iterations; i++) {
cufftExecR2C(fftPlanFwd, (cufftReal *)d_Data, (cufftComplex *)d_DataSpectrum);
cudaDeviceSynchronize();
}
// free data
free(h_Data);
cufftDestroy(fftPlanFwd);
cudaFree(d_Data);
cudaFree(d_DataSpectrum);
}
Notice that here we pass in the width and height of the image, and generate the random image in C++.
That data is then transferred to the GPU.
The iterations
parameters specifies the number of times we perform the exact same FFT (to measure runtime).
Note that in doing so we are not copying the image from CPU (host) to GPU (device) at each iteration, so the performance measurement does not include the time to copy the image.
Depending on the application, the performance of cudaMemcpy
may be critical.
This C++ function can be easily exposed to Python using pybind11
with the following code:
PYBIND11_MODULE(fft, m) {
m.def("gpu_fft", gpu_fft, "dataH"_a, "dataW"_a, "iterations"_a)
}
The full code is available on GitHub.
Then we can measure the performance of this GPU implementation using the time_function
:
iterations = 10000
cuda_fft = partial(gpu_fft, n, n, iterations)
cuda_time = time_function(cuda_fft)*1e3/iterations # in ms
Here, I chose 10,000 iterations of the FFT, so that cudaMemcpy
only runs for every 10,000 iterations.
If you choose iterations=1
, the measured runtime would include memory allocation and deallocation, which may not be needed depending on your application.
Performance comparison
With all of the functions defined, we can time them at different image sizes n
:
The results on the left show the runtime for each function on a log-log scale.
We can see that for all but the smallest of image sizes, cuFFT > PyFFTW > NumPy
.
On the right is the speed increase of the cuFFT implementation relative to the NumPy and PyFFTW implementations.
For the largest images, cuFFT is an order of magnitude faster than PyFFTW and two orders of magnitude faster than NumPy.