Block reduction in CUDA

In many CUDA reduction examples, the basic idea is to perform a block reduction first and then reducing the partial results from all the blocks. It is good to know that CUB provides a block reduction primitive, called BlockReduce. On our GitHub website a fully worked example is available. In that example, an array of length N is created and the result is the sum of 32 consecutive elements, being 32 the block size: result[0] = data[0] + ... + data[31]; result[1] = data[32] + ... + data[63]; ....
More