Limiting register usage in CUDA

The preface of this post is that, quoting the CUDA C Programming Guide, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.

Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.

__launch_bounds__

nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers “guarantees effectiveness” for different numbers of threads per block and of blocks per multiprocessor.

However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches.

In other words:


#define MAX_THREADS_PER_BLOCK 256

#define MIN_BLOCKS_PER_MP     2

__global__ void

__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)

fooKernel(int *inArr, int *outArr)

{

// ... Computation of kernel

}

informs the compiler of a likely launch configuration, so that nvcc can select the number of registers for such a launch configuration in an “optimal” way.

The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail.

The limiting mechanism is described in the Programming Guide as follows:

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of  maxThreadsPerBlock threads can reside on the multiprocessor. The compiler then optimizes register usage in the following way:

–      If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;

Accordingly, __launch_bounds__ can lead to register spill.

maxrregcount

maxrregcount is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__, by forcing the compiler to rearrange its use of registers. When the compiler can’t stay below the imposed limit, it will simply spill it to L1, L2 and DRAM.

Leave a Reply

Your email address will not be published. Required fields are marked *