In many CUDA reduction examples, the basic idea is to perform a block reduction first and then reducing the partial results from all the blocks.
It is good to know that CUB provides a block reduction primitive, called BlockReduce.
On our GitHub website a fully worked example is available.
In that example, an array of length N is created and the result is the sum of 32 consecutive elements, being 32 the block size:
result = data + ... + data; result = data + ... + data; ....