CudaMallocPitch and cudaMemcpy2D – Update

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following:

cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);

where

  • devPtr is an output pointer to float (float *devPtr) pointing to the allocated memory space;
  • devPitch is a size_t output variable denoting the length, in bytes, of the padded row;
  • Nrows and Ncols are size_t input variables representing the matrix size.

Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.

CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch. Under the above hypotheses (single precision 2D matrix), the syntax is the following:

cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)

where

  • devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
  • devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
  • Nrows and Ncols are size_t input variables representing the matrix size.

Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to

cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)

The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example

int    tidx = blockIdx.x*blockDim.x + threadIdx.x;
int    tidy = blockIdx.y*blockDim.y + threadIdx.y;
 
if ((tidx < Ncols) && (tidy < Nrows))
{
   float *row_a = (float *)((char*)devPtr + tidy * pitch);
   row_a[tidx] = row_a[tidx] * tidx * tidy;
}

In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence).
The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.

On our GitHub site, we provide a fully worked example to show these concepts.

Finally, please, notice that CUDA makes the cudaMallocPitch available to allocate padded 2D arrays and cudaMalloc3D to allocate padded 3D arrays. However, a “cudaMalloc2D” function does not exist, its role being accomplished by cudaMallocPitch.

The final question regards whether the use of pitched memory allocation, i.e., memory allocation done with cudaMallocPitch, really leads to improved performance as compared to non-pitched memory allocation, i.e., done with cudaMalloc.
Actually, the improvements arising from the use of cudaMallocPitch depend on the compute capability and are expected to be more significant for older ones. However, for most recent compute capabilities, pitched memory allocation does not seem to lead to a relevant speedup.

The attached code  provides a performance testbench between the uses of non-pitched and pitched memories.
In particular, the code performs the summation between three (non-pitched or pitched) matrices. The reason for dealing with three matrices is the need to highlight memory transactions as compared to computation, so to highlight the differences between non-pitched and pitched allocations. Below are the timing results for a GTX 960 card and a GT 920M cards.

GTX 960
Non-pitched – Time = 3.242208; Memory = 65320000 bytes
Pitched        – Time = 3.150944;  Memory = 65433600 bytes

GT 920M
Non-pitched – Time = 20.496799; Memory = 65320000 bytes
Pitched        – Time = 20.418560; Memory = 65433600 bytes

As it can be seen, there is not much difference in the two implementations for the two cards. The above results also show the increase in memory occupancy due to the use of pitched memory allocation.

Leave a Reply

Your email address will not be published. Required fields are marked *