Matrix transposition in CUDA

Matrix transposition is a very common operation in linear algebra. From a numerical point of view, it is a memory bound problem since there is practically no arithmetics in it and the operation essentially consists of rearranging the layout of the matrix in memory. Due to the particular architecture of a GPU and to the cost of performing global memory operations, matrix transposition admits no naive implementation if performance is of interest. We here compare two different possibilities of pe...
More

Tricks and Tips: cudaMallocPitch and cudaMemcpy2D

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned. CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information. Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch is the following: cudaMallocPitch(&dev...
More

CUDA Pinned memory

In the framework of accelerating computational codes by parallel computing on Graphics Processing Units (GPU) , the data to be processed must be transferred from the CPU to the GPU and the results of the processing from the GPU to the CPU. In a computational code accelerated by General Purpose GPU (GPGPU) computing, such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises. To allow programmers to ...
More