 On our GitHub web page, we have made available a fully worked Matlab implementation of a radix-4 Decimation in Frequency FFT algorithm. In the code, we have also provided an overall operations count in terms of complex matrix multiplications and additions. It can be indeed shown that each radix-4 butterfly involves 3 complex multiplications and 8 complex additions. Since there are log4N = log2N / 2 stages and each stage involves N / 4 butterflies, so the operations count is complex mul...
More

# Understanding the radix-2 FFT recursive algorithm The recursive implementation of the radix-2 Decimation-In-Frequency algorithm can be understood using the following two figures. The first one refers to pushing the stack phase, while the second one illustrates the popping the stack phase.   In particular, the two figures illustrate the Matlab implementation that you may find on our GitHub website: Implementation I Implementation II The above recursive implementation is the counterpart of the iterative implemen...
More At the github page, we prove an implementation of the radix-2 Decimation-In-Frequency FFT in Matlab. The code is an iterative one and considers the scheme in the following figure: A recursive approach is also possible. The implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in “Number of operation counts for radix-2 FFTs”. The code is obviously much slower than the highly optimized FFTW explo...
More At the github page, we prove an implementation of the radix-2 Decimation-In-Time FFT in Matlab. The code is an iterative one and considers the scheme in the following figure:   A recursive approach is also possible. The implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in “Number of operation counts for radix-2 FFTs”. The code is obviously much slower than the highly optimized FFTW ...
More

# Number of operation counts for radix-2 FFTs The actual Floating Point Operations per Second (FLOPS) depend on the particular hardware and implementation and algorithms with higher number of Floating Point Operations (FLOPs) may correspond to lower FLOPS implementations, just because with such implementations you can more effectively exploit the hardware. To compute the number of floating point operations for a Decimation In Time (DIT) radix-2 approach, refer to the following figure: Let N be the length of the sequence to be transfor...
More

# Finite Impulse Response (FIR) Filter in CUDA implemented as a 1D convolution

In this post, we consider the problem of calculating the output of a FIR (Finite Impulse Response) filter by directly evaluating the 1D convolution in CUDA. In the case when the filter impulse response duration is long, one thing that can be done to evaluate the filtered input is performing the calculations directly in the conjugate domain using FFTs. On our GitHub website, a sample code using the cuFFT library is reported. It is a direct translation of the Matlab-based example reported at Low...
More

# Tricks and Tips: 1D batched FFTs of real arrays

As reported in the cuFFT User Guide - CUDA 6.5, batch sizes other than 1 for cufftPlan1d() have been deprecated. Use cufftPlanMany() for multiple batch execution. Below, a fully worked example using cufftPlanMany() instead of cufftPlan1d() is reported. As it can be seen, int rank = 1; // --- 1D FFTs int n[] = { DATASIZE }; // --- Size of the Fourier transform int istride = 1, ostride = 1; // --- Distance between two successive input/output elements int idist = DATASIZE, odist = (DATASIZE / 2 + ...
More

# 1D FFTs of columns and rows of a 3D matrix in CUDA

In this post we explore the possibility of using a single cufftPlanMany to perform 1D FFTs of the columns of a 3D matrix. Transformations performed according to cufftPlanMany, that is called like cufftPlanMany(&handle, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_C2C, batch); must obey the Advanced Data Layout . In particular, 1D FFTs are worked out according to the following layout input[b * idist + x * istride] where b addresses the b-th signal and istride is the distan...
More

# Asynhcronous executions of CUDA memory copies and cuFFT

Suppose we have to calculate the FFTs of an array of size N. In this post, we face the question whether is it possible to hide the latency by concurrently executing the cuFFT with the memory copy of the array from the host to the device. In other words, we want to copy a first part of the array from host to device, begin the calculations on this portion of the array, then concurrently copying a second part of the array and so on. This behavior can be "emulated" with a proper use of zero padding...
More

# CuFFT and streams – Kepler architecture

This is a worked example of cuFFT execution and memcopies using streams in CUDA on the Kepler architecture. #include <stdio.h> #include <cufft.h> #define NUM_STREAMS 3 /********************/ /* CUDA ERROR CHECK */ /********************/ #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); } inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true) { if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %dn", cudaGetErrorString(code)...
More