Customized Stream Compaction

Stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.

The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not.
The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.

Here, we present the parallel implementation approach in [1,2]. The fully implemented algorithm is available on our GitHub website and consists of three steps:

1. Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size.
The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred.
As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.

2. Step #2. An exclusive scan operation is performed on the array d_BlockCounts.
As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.

3. Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.

Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.

For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons:
1) It is specifically tailored to cards supporting warp intrinsic elements;
2) The approach does not guarantee the output ordering.

It should be noticed that we have tested the approach also against the code available at Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.

[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 – 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

Leave a Reply

Your email address will not be published. Required fields are marked *