In this post, we consider the problem of calculating the output of a FIR (Finite Impulse Response) filter by directly evaluating the 1D convolution in CUDA.
In the case when the filter impulse response duration is long, one thing that can be done to evaluate the filtered input is performing the calculations directly in the conjugate domain using FFTs.
Opposite to that, when the convolution kernel has short duration, implementing a hand-written kernel exploiting shared memory caching may be more convenient. An example is reported on our GitHub website.
More details can be found in the book by D.B. Kirk and W.-m. W. Hwu: Programming Massively Parallel Processors, Second Edition: A Hands-on Approach