Limiting register usage in CUDA

The preface of this post is that, quoting the CUDA C Programming Guide, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance. Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms. __launch_bounds__ nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it diffe...
More

CudaMallocPitch and cudaMemcpy2D – Update

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned. CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information. Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following: cudaMallocPitch...
More

A thing to care about when passing a struct to a CUDA kernel

Structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel. Consider this example with the uncommented destructor and do not pay too much attention on what the code actually does. If you run that code, you will receive the following output: Calling destructor Counting in the locked case: 512 Calling destructor GPUassert: invalid device pointer D:/Project/passStructToKer...
More

Concurrency in CUDA multi-GPU executions

Achieving concurrent executions on multi-GPU systems is a very appealing feature since it can further linearly scale the execution time of embarrassingly parallel problems. We have done some experiments on achieving concurrent execution on a cluster of 4 Kepler K20c GPUs. We have considered 8 test cases, whose corresponding codes along with the profiler timelines are reported below. Test case #1 - "Breadth-first" approach - synchronous copy Code - https://github.com/OrangeOwlSolutions/Multi...
More

Optimizing the solution of the 2D diffusion (heat) equation in CUDA

On our GitHub website we are posting a fully worked code concerning the optimization of the solution approach for the 2D heat equation. Five approaches are considered, using: Global memory, essentially the OP's approach; Shared memory of size BLOCK_SIZE_X x BLOCK_SIZE_Y not loading the halo regions; Shared memory of size BLOCK_SIZE_X x BLOCK_SIZE_Y loading the halo regions; Shared memory of size (BLOCK_SIZE_X + 2) x (BLOCK_SIZE_Y + 2) loading the halo regions; Texture memory. E...
More

Dealing with boundary conditions in CUDA

Dealing with boundary conditions in CUDA is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries. Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes. On our GitHub website an example is provided on how using text...
More

Tricks and Tips: Finding the minimum element position of a cudaMalloc'ed array by thrust::min element

In this post, we are considering the very common case when one has a cudaMalloc'ed array and wants to find the position and the value of its minimum element by thrust::min_element. Below, we are providing a full example of that. Basically, the solution below exploits the idea of wrapping a thrust::device_ptr around the cudaMalloc'ed linear memory and the position is found by thrust::distance. Here is the full code: #include <thrust/device_vector.h> #include <thrust/extrema.h> /***...
More

1D Finite Difference Time Domain (FDTD) in CUDA for the Helmholtz equation

There is a question on whether 1D Finite Difference Time Domain (FDTD) method can be faster when implemented C/C++ and run on a sequential machine rather than when implemented in CUDA and run on a parallel GPU. We are trying to answer this question with the below code. It contains both an implementation of the 1D FDTD method for an electromagnetic application in C/C++ and CUDA. Theory and C/C++ implementations are taken from Understanding the Finite-Difference Time-Domain Method (see Program 3....
More

Tricks and Tips: cudaMalloc3D and cudaMemcpy3D

When accessing 3D arrays in CUDA, memory transactions are much faster if the data elements properly aligned. CUDA provides the cudaMalloc3D function to "pad" the elements of 3D matrices with extra bytes so to achieve the desired alignment. Refer to the "CUDA C Programming Guide", Sections 3.2.2 and 5.3.2, for more information. Such a function returns also the allocation pitches, which can be subsequently used by cudaMemcpy3D to properly access the data. Below, a full code is reported showing ho...
More