In the framework of accelerating computational codes by parallel computing on Graphics Processing Units (GPU) , the data to be processed must be transferred from the CPU to the GPU and the results of the processing from the GPU to the CPU.
In a computational code accelerated by General Purpose GPU (GPGPU) computing, such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises.
To allow programmers to use a larger virtual address space than is actually available in the RAM, CPUs (or hosts, in the language of GPGPU) implement a virtual memory system (non-locked memory) in which a physical memory page can be swapped out to disk. When the host needs that page, it loads it back in from the disk.
The drawback with CPUGPU memory transfers is that memory transactions are slower, i.e., the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited.
Non-locked memory is stored not only in memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access (DMA) (synchronous, page-by-page copy).
Indeed, PCI-E transfers occur only using the DMA. Accordingly, when a “normal” transfer is issued, an allocation of a block of page-locked memory is necessary, followed by a host copy from regular memory to the page-locked one, the transfer, the wait for the transfer to complete and the deletion of the page-locked memory.
This consumes precious host time which is avoided when directly using page-locked memory.
However, with today’s memories, the use of virtual memory is no longer necessary for many applications which will fit within the host memory space.
In all those cases, it is more convenient to use page-locked (pinned) memory which enables a DMA on the GPU to request transfers to and from the host memory without the involvement of the CPU. In other words, locked memory is stored in the physical memory (RAM), so the GPU (or device, in the language of GPGPU) can fetch it without the help of the host (synchronous copy).
GPU memory is automatically allocated as page-locked, since GPU memory does not support swapping to disk. To allocate page-locked memory on the host in CUDA language one could use cudaHostMalloc.
float a; cudaHostAlloc((void**)&a,n*sizeof(a),cudaHostAllocDefault);
1. Once pinned, pinned memory is unavailable to other processes and in particular it reduces the memory pool available to the OS.
Accordingly, what processes are concurrently running in the OS (e.g. whether they themselves are pinning memory) will determine how much memory is available to be pinned.
2. Pinned memory is useful to issue host-device memory transfer at maximum rates (dictated by the speed of the PCI-E bus) or in the framework of CUDA streams when asynchronous mem copy calls need the host memory to be page-locked.