Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread).
Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the thrust::device execution policy to run in parallel within a single CUDA thread exploiting CUDA dynamic parallelism.
An example of both is reported on our GitHub website.
The example performs reductions of the rows of a matrix in the same sense as Reduce matrix rows with CUDA, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels.
Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely, thrust::seqand thrust::device.
Below, some graphs showing the difference in performance.
The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.