The preface of this post is that, quoting the CUDA C Programming Guide, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.
Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.
__launch_bounds__
nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it diffe...

More
# GitHub

# CudaMallocPitch and cudaMemcpy2D – Update

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following:
cudaMallocPitch...

More
# Implementing a critical section in CUDA

Critical sections are sequences of operations that must be executed sequentially by the CUDA threads.
Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block with threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.
This is illustrated in the code on our GitHub web page . S...

More
# Tricks and Tips – Using omp_set_num_threads and omp_get_num_threads

When programming with OpenMP, it should be noticed that omp_get_num_threads() returns 1 in sequential sections of the code.
Accordingly, even if setting, by omp_set_num_threads(), an overall number of threads larger than 1, any call to omp_get_num_threads() will return 1, unless we are in a parallel section.
The example on our GitHub website tries to clarify this point.

More
# Radix-4 Decimation-In-Frequency Iterative FFT

On our GitHub web page, we have made available a fully worked Matlab implementation of a radix-4 Decimation in Frequency FFT algorithm.
In the code, we have also provided an overall operations count in terms of complex matrix multiplications and additions.
It can be indeed shown that each radix-4 butterfly involves 3 complex multiplications and 8 complex additions.
Since there are log4N = log2N / 2 stages and each stage involves N / 4 butterflies, so the operations count is
complex mul...

More
# Understanding the radix-2 FFT recursive algorithm

The recursive implementation of the radix-2 Decimation-In-Frequency algorithm can be understood using the following two figures.
The first one refers to pushing the stack phase, while the second one illustrates the popping the stack phase.
In particular, the two figures illustrate the Matlab implementation that you may find on our GitHub website:
Implementation I
Implementation II
The above recursive implementation is the counterpart of the iterative implemen...

More
# Radix-2 Decimation-In-Frequency Iterative FFT

At the github page, we prove an implementation of the radix-2 Decimation-In-Frequency FFT in Matlab. The code is an iterative one and considers the scheme in the following figure:
A recursive approach is also possible.
The implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in “Number of operation counts for radix-2 FFTs”.
The code is obviously much slower than the highly optimized FFTW explo...

More
# Radix-2 Decimation-In-Time Iterative FFT

At the github page, we prove an implementation of the radix-2 Decimation-In-Time FFT in Matlab. The code is an iterative one and considers the scheme in the following figure:
A recursive approach is also possible.
The implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in “Number of operation counts for radix-2 FFTs”.
The code is obviously much slower than the highly optimized FFTW ...

More
# Sorting 2 or 3 arrays by key with CUDA Thrust

We have compared two approaches to sort arrays by key, with the same key. One of those approaches uses thrust::zip_iterator and the other thrust::gather.
We have tested them in the case of sorting two arrays or three arrays. In all the two cases, the approach using thrust::gather has shown to be faster.
The full codes are available on our GitHub website:
2 Arrays solution
3 Arrays solution
In the following, some timing results (NVIDIA GTX 960 card):
Timing in the case of 2 arrays for...

More
# Customized Stream Compaction

Stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.
The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array de...

More