Reduction examples in CUDA typically refer to the case of large arrays. However, in some cases there is the need to sum a very large number of small arrays.
For this case, warp reduction offers many advantages.
Instead of coding your own warp reduction, a very good point is to use CUB primitives, in particular, CUB’s WarpReduce primitive.
On our GitHub website a fully worked example is available.
In that example, an array of length N is created and the result is the sum of 32 consecutive elements:
result = data + ... + data; result = data + ... + data; ....