Tricks and Tips: cudaMallocPitch and cudaMemcpy2D

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.

CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.

Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch is the following:

cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);

where

  • devPtr is an output pointer to float (float *devPtr);
  • devPitch is a size_t output variable denoting the length, in bytes, of the padded row;
  • Nrows and Ncols are size_t input variables representing the matrix size.

Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data.
Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.
CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch.
Under the above hypotheses (single precision 2D matrix), the syntax is the following:

cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)

where

  • devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
  • devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
  • Nrows and Ncols are size_t input variables representing the matrix size.

Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device.
If we want to copy data from device to host, then the above line changes to

cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)

The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example

int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;

if ((tidx < Ncols) && (tidy < Nrows))
{
   float *row_a = (float *)((char*)devPtr + tidy * pitch);
   row_a[tidx] = row_a[tidx] * tidx * tidy;
}

In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence).
The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.

Below, we provide a fully worked example to show these concepts.

#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>

#define BLOCKSIZE_x 16
#define BLOCKSIZE_y 16

#define Nrows 3
#define Ncols 5

/*****************/
/* CUDA MEMCHECK */
/*****************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
      fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
      if (abort) { getch(); exit(code); }
    }
}

/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int hostPtr, int b){ return ((hostPtr % b) != 0) ? (hostPtr / b + 1) : (hostPtr / b); }

/******************/
/* TEST KERNEL 2D */
/******************/
__global__ void test_kernel_2D(float *devPtr, size_t pitch)
{
   int tidx = blockIdx.x*blockDim.x + threadIdx.x;
   int tidy = blockIdx.y*blockDim.y + threadIdx.y;

   if ((tidx < Ncols) && (tidy < Nrows))
   {
       float *row_a = (float *)((char*)devPtr + tidy * pitch);
       row_a[tidx] = row_a[tidx] * tidx * tidy;
    }
}

/********/
/* MAIN */
/********/
int main()
{
   float hostPtr[Nrows][Ncols];
   float *devPtr;
   size_t pitch;

   for (int i = 0; i < Nrows; i++)
   for (int j = 0; j < Ncols; j++) {
   hostPtr[i][j] = 1.f;
   //printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
}

// --- 2D pitched allocation and host->device memcopy
gpuErrchk(cudaMallocPitch(&devPtr, &pitch, Ncols * sizeof(float), Nrows));
gpuErrchk(cudaMemcpy2D(devPtr, pitch, hostPtr, Ncols*sizeof(float), Ncols*sizeof(float), Nrows, cudaMemcpyHostToDevice));

dim3 gridSize(iDivUp(Ncols, BLOCKSIZE_x), iDivUp(Nrows, BLOCKSIZE_y));
dim3 blockSize(BLOCKSIZE_y, BLOCKSIZE_x);

test_kernel_2D << <gridSize, blockSize >> >(devPtr, pitch);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());

gpuErrchk(cudaMemcpy2D(hostPtr, Ncols * sizeof(float), devPtr, pitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost));

for (int i = 0; i < Nrows; i++)
   for (int j = 0; j < Ncols; j++)
      printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);

return 0;

}

8 thoughts on “Tricks and Tips: cudaMallocPitch and cudaMemcpy2D

  1. Maksudul Alam says:

    Great Article, helped me understanding using the pitch for rectangular grid. Thanks!

  2. Nayan Yengul says:

    Hello,
    This is very useful article. But I have one doubt, is is necessary to allocate 2d array on host statically? Can we allocate our host 2d array dynamically and then copy to device?
    Please suggest me solution.

    1. OrangeOwl says:

      Yes, the code can be employed either when the host array is statically or dynamically allocated.

    2. Liangchao Shang says:

      Sure, you can allocate host 2d array dynamically and cpoy to/from device

  3. Corey D says:

    One question:

    in the kernel function, should
    (tidx<M) && (tidy<N)
    be
    (tidx<N) && (tidy<M) ?

    1. OrangeOwl says:

      Thank you for your inquiry. Actually, the lines you mentioned were correct, but there was a mistake in addressing the row elements.
      We have fully re-edited the post, fixing the code and better explaining the cudaMallocPitch and cudaMemcpy2D functions.

  4. phillip Smith says:

    I am trying this with a dynamically allocated 2D array of ints like this
    int **popArray;
    popArray = (int **)calloc(PopSize,sizeof(int*));
    int *trialsArray = (int *)calloc(PopSize,sizeof(int));
    for (int indv = 0; indv < PopSize; indv++)
    {
    printf("starting Indv = %d \n",indv);
    popArray[indv] = (int *)calloc(GenomeLength , sizeof(int));
    }
    I have an cudarray
    int *gpuStartPop;
    cudaMallocPitch((void**)&gpuStartPop, &pitch, lenG * sizeof(int), PopSize);

    kernel stuff here

    I copy it back to the host
    cudaMemcpy2D(startPop,pitch,gpuStartPop,pitch,lenG * sizeof(int),PopSize,cudaMemcpyDeviceToHost)

    I can then go through the startPop getting the genomes back
    int i = 0;
    for (i = 0; i < PopSize; i++)
    {
    //printf("pointer = %d of array size %d \n",(int) (i * pitch), (int)sizeof(startPop));
    int* newGenome = (int*)((char*)startPop + i * pitch);
    //printf("sizeof New Genome = %d \n",(int)sizeof(newGenome));
    printGenome(newGenome,lenG);

    }

    However I always get "Segmentation fault (core dumped)" when my program finishes. If I comment out the cudaMemcpy2D back from the gpu to host the problem goes. any ideas

    1. phillip Smith says:

      sorry this may be confusing I set up PopSize in in the file main.cpp and pass it to my file by reference to my kernel.cu page
      lenG = genomelength

Leave a Reply

Your email address will not be published. Required fields are marked *