# Limiting register usage in CUDA

The preface of this post is that, quoting the CUDA C Programming Guide, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance. Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms. __launch_bounds__ nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it diffe...
# SVD of a real matrix in CUDA

The calculation of the Singular Value Decomposition (SVD) of a matrix is at the basis of many computations and approaches in applied science. One example is the regularized solution of linear systems of equations. Another is Principal Component Analysis. Many times, the applications requiring the SVD calculation deal with large matrices and/or request the SVD computation in an iterative process. Fortunately, the SVD can be quickly computed in CUDA using the routines provided in the cuSOLVE...
# CUDA mex function using real data residing on the host and producing real results on the host

In the CUDA_mex_host_to_device GitHub directory, we provide an example on how creating a mex function executing on the GPU when the input real data reside on the host and the final results are returned on the host. The first thing to do is to recover the pointer to the first element of the real data from the Matlab input array/matrix: double *h_input = mxGetPr(prhs[0]); We can also recover the number of elements of the input variable (the input variable can be also a matrix) as: in...
# CudaMallocPitch and cudaMemcpy2D – Update

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned. CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information. Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following: cudaMallocPitch...
# Implementing a critical section in CUDA

Critical sections are sequences of operations that must be executed sequentially by the CUDA threads. Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block with threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section. This is illustrated in the code on our GitHub web page . S...
# Graphical connections to Ubuntu Linux from Windows

Suppose that you have a Windows system and that you want to connect to a remote Linux Ubuntu machine; suppose that you also want to run some applications of that machine, having at disposal also their graphical interface. First step: configure the Windows system Download Putty. Install Xming. Use a simple google search for “sourceforge xming x server windows”. When asking for the fonts to be installed in the “Custom installation” stage, we would recommend to install all the fonts. Seco...
