Limiting register usage in CUDA

The preface of this post is that, quoting the CUDA C Programming Guide, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance. Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms. __launch_bounds__ nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it diffe...
More

SVD of a real matrix in CUDA

The calculation of the Singular Value Decomposition (SVD) of a matrix is at the basis of many computations and approaches in applied science. One example is the regularized solution of linear systems of equations. Another is Principal Component Analysis. Many times, the applications requiring the SVD calculation deal with large matrices and/or request the SVD computation in an iterative process. Fortunately, the SVD can be quickly computed in CUDA using the routines provided in the cuSOLVE...
More

CUDA mex function using real data residing on the host and producing real results on the host

In the CUDA_mex_host_to_device GitHub directory, we provide an example on how creating a mex function executing on the GPU when the input real data reside on the host and the final results are returned on the host. The first thing to do is to recover the pointer to the first element of the real data from the Matlab input array/matrix: double *h_input = mxGetPr(prhs[0]); We can also recover the number of elements of the input variable (the input variable can be also a matrix) as: in...
More

CudaMallocPitch and cudaMemcpy2D – Update

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned. CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information. Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following: cudaMallocPitch...
More

Implementing a critical section in CUDA

Critical sections are sequences of operations that must be executed sequentially by the CUDA threads. Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block with threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section. This is illustrated in the code on our GitHub web page . S...
More

A thing to care about when passing a struct to a CUDA kernel

Structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel. Consider this example with the uncommented destructor and do not pay too much attention on what the code actually does. If you run that code, you will receive the following output: Calling destructor Counting in the locked case: 512 Calling destructor GPUassert: invalid device pointer D:/Project/passStructToKer...
More

Count the occurrences of numbers in a CUDA array

We comparing two approaches to count the occurrences of numbers in a CUDA array. The two approaches use CUDA Thrust: Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example; Using thrust::unique_copy and thrust::upper_bound. A fully worked example is available on our GitHub page. The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, we have had the following timings for a number of N = 1048576 array elements: First ap...
More

Compiling Cuda mex files with Visual Studio 2013

Configuration: Matlab 2015b, Visual Studio 2013, Intel 64bit machine. In Visual Studio do the following: 1) File -> New Project; Select location and name; in the project type, select NVIDIA -> CUDA 8.0 (choose your CUDA version as appropriate); 2) Project -> Properties -> Configuration Manager -> Active Solution Platform -> choose x64; 3) Project -> Properties -> Configuration -> Release (possibly optional); 4) Project -> Properties -> Configuration ...
More

Sorting 2 or 3 arrays by key with CUDA Thrust

We have compared two approaches to sort arrays by key, with the same key. One of those approaches uses thrust::zip_iterator and the other thrust::gather. We have tested them in the case of sorting two arrays or three arrays. In all the two cases, the approach using thrust::gather has shown to be faster. The full codes are available on our GitHub website: 2 Arrays solution 3 Arrays solution In the following, some timing results (NVIDIA GTX 960 card): Timing in the case of 2 arrays for...
More