On problem of interest, is that of extending the approach in "Sorting many small arrays by key in CUDA" to the case when multiple arrays must be ordered according to the same key.
Unfortunately, it is not possible to use cub::BlockRadixSort by "packing" the arrays using zip iterators and tuples. Accordingly, we have exploited an helper index approach.
On our GitHub website a fully worked example is reported.

More
# Cub

# Sorting many small arrays by key in CUDA

In many applications, the problem of sorting many small arrays by key in CUDA arises.
CUB offers a possible solution to face this problem. On our GitHUb website, we report an example that can be reused for this purpose.
This example was constructed around the previously posted "Sorting many small arrays in CUDA".

More
# Sorting many small arrays in CUDA

In many applications, the problem of sorting many small arrays in CUDA arises.
CUB offers a possible solution to face this problem. On our Git Hub website, we report an example that can be reused for this purpose.
The idea is assigning the small arrays to be sorted to different thread blocks and then using cub::BlockRadixSort to sort each array. Two versions are provided, one loading and one loading the small arrays into shared memory.

More
# Cost functional calculation for local and global optimization in CUDA

Many cost functionals are expressed in the form of summations of a certain number of terms.Examples are the:
Sphere function
Rosenbrock function
Styblinski-Tang function
In all those cases, the evaluation of the cost function can be performed by a reduction, or better, a transformation followed by a reduction.
Typically, large number of parameters to be optimized are managed by local optimization.
In this case, CUDA Thrust has thrust::tr...

More
# Residual calculation with CUB

In this post, we deal with the calculation of the residual between two vectors. In other words, we want to calculate the residual defined by the following pseudocode
residual = 0;
forall i : residual += (oldvector[i] - newvector[i])^2
This operation can be conveniently performed by CUDA CUB.
At variance with Thrust, CUB leaves performance-critical parameters, such as the choice of specific reduction algorithm to be used and the degree of concurrency unbound, selectable by the user.
These param...

More
# Block reduction in CUDA

In many CUDA reduction examples, the basic idea is to perform a block reduction first and then reducing the partial results from all the blocks.
It is good to know that CUB provides a block reduction primitive, called BlockReduce.
On our GitHub website a fully worked example is available.
In that example, an array of length N is created and the result is the sum of 32 consecutive elements, being 32 the block size:
result[0] = data[0] + ... + data[31];
result[1] = data[32] + ... + data[63];
....

More
# Warp reduction in CUDA

Reduction examples in CUDA typically refer to the case of large arrays. However, in some cases there is the need to sum a very large number of small arrays.
For this case, warp reduction offers many advantages.
Instead of coding your own warp reduction, a very good point is to use CUB primitives, in particular, CUB's WarpReduce primitive.
On our GitHub website a fully worked example is available.
In that example, an array of length N is created and the result is the sum of 32 consecutiv...

More
# Tricks and Tips: Calling CUDA Thrust primitives from within a kernel

Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.
If one wants parallel execution within a thread, then using CUB must be considered, which provides reduction routines that can be called from within a threadblock, provided that the relevant card enables dynamic parallelism.
Here is the example with Thrust:
#include <...

More
# Sorting with CUDA libraries

In this post, we will see two possibilities to perform sorting in CUDA using libraries. The first is using Thrust, while the second is CUB, which has several "device-wide" primitives.
Thrust uses radix-sorting for fundamental data types when using the already defined comparison operations, see Thrust: A Productivity-Oriented Library for CUDA.
Opposite to that, when one needs to sort objects by defining customized struct comparison operators, Thrust will switch to merge-sort. Radix-sort is remar...

More