Limiting register usage in CUDA

The preface of this post is that, quoting the CUDA C Programming Guide, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance. Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms. __launch_bounds__ nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it diffe...
More

SVD of a real matrix in CUDA

The calculation of the Singular Value Decomposition (SVD) of a matrix is at the basis of many computations and approaches in applied science. One example is the regularized solution of linear systems of equations. Another is Principal Component Analysis. Many times, the applications requiring the SVD calculation deal with large matrices and/or request the SVD computation in an iterative process. Fortunately, the SVD can be quickly computed in CUDA using the routines provided in the cuSOLVE...
More

CUDA mex function using real data residing on the host and producing real results on the host

In the CUDA_mex_host_to_device GitHub directory, we provide an example on how creating a mex function executing on the GPU when the input real data reside on the host and the final results are returned on the host. The first thing to do is to recover the pointer to the first element of the real data from the Matlab input array/matrix: double *h_input = mxGetPr(prhs[0]); We can also recover the number of elements of the input variable (the input variable can be also a matrix) as: in...
More

CudaMallocPitch and cudaMemcpy2D – Update

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned. CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information. Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following: cudaMallocPitch...
More

Implementing a critical section in CUDA

Critical sections are sequences of operations that must be executed sequentially by the CUDA threads. Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block with threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section. This is illustrated in the code on our GitHub web page . S...
More

Graphical connections to Ubuntu Linux from Windows

Suppose that you have a Windows system and that you want to connect to a remote Linux Ubuntu machine; suppose that you also want to run some applications of that machine, having at disposal also their graphical interface. First step: configure the Windows system Download Putty. Install Xming. Use a simple google search for “sourceforge xming x server windows”. When asking for the fonts to be installed in the “Custom installation” stage, we would recommend to install all the fonts. Seco...
More

Tricks and Tips – Using omp_set_num_threads and omp_get_num_threads

When programming with OpenMP, it should be noticed that omp_get_num_threads() returns 1 in sequential sections of the code. Accordingly, even if setting, by omp_set_num_threads(), an overall number of threads larger than 1, any call to omp_get_num_threads() will return 1, unless we are in a parallel section. The example on our GitHub website tries to clarify this point.
More

Compiling mex files with Visual Studio 2013

Configuration: Matlab 2015b, Visual Studio 2013, Intel 64bit machine. In Visual Studio do the following: 1) File -> New Project; Select location and name; in the project type, select Templates -> Visual C++ -> Win32 -> Win32 Console Application -> OK; 2) In the Win32 Application Wizard, click Next, in the Application Type choose DLL, then click Finish. 3) Project -> Properties -> Configuration Manager -> Active Solution Platform -> New -> Type or Select ...
More

A thing to care about when passing a struct to a CUDA kernel

Structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel. Consider this example with the uncommented destructor and do not pay too much attention on what the code actually does. If you run that code, you will receive the following output: Calling destructor Counting in the locked case: 512 Calling destructor GPUassert: invalid device pointer D:/Project/passStructToKer...
More