Tricks and Tips: Replicate a vector multiple times using CUDA Thrust

Suppose we have an array of M elements. We want to create a new array of M * N elements in which the M elements of the original array are repeated N times. In other words, if we M = 3 and the original array is {1, 2, 3}, we  want to end up with {1, 2, 3, 1, 2, 3, ...}. This can be easily done with CUDA Thrust as an application of the expand operator. On our GitHub page a fully worked example is available.
More

Tricks and Tips: Scaling the rows of a matrix with CUDA

Suppose that we want to scale the rows of a matrix as follows: Besides writing your own CUDA kernel, there are two possibilities: CUDA Thrust's thrust::transform and cuBLAS's cublasdgmm. On our GitHub page a full example is reported. We have tested the above code on a Kepler K20c and these are the result: Size Thrust cuBLAS 2500 x 1250 0.20ms 0.25ms 5000 x 2500 0.77ms 0.83ms In the cuBLAS timing, we are excluding the cublasCreate time. Even with this, the CUDA Thrust version seems ...
More

Tricks and Tips: Reordering matrix rows by key

Ordering an array by key is something that can be achieved in a very simple way in CUDA by using CUDA Thurst sort_by_key or stable_sort_by_key. But what happens if we want to order the rows of a matrix according to a key or "membership"? For example, consider the following matrix [ 10 17 64 90 97 27 56 45 ] [ 33 76 18 60 62 82 63 56 ] [ 88 99 75 96 36 48 90 68 ] [ 91 96 24 87 91 36 94 47 ] [ 37 56 45 81 72 58 63 18 ] along with the following row keys 3 2 2 4 2 We want to order the matrix rows ac...
More

Tricks and Tips: Determining the least element and its position in each matrix column with CUDA Thrust

In this post, we consider two approaches for determining the minimum element along each column of a matrix. The first uses Thrust's reduce_by_key in conjunction with a transform iterator which performs an implicit matrix transposition. The second operates de facto an ordering of each row along with the corresponding element columns' indices . Of course, the second approach does something more than the first one. From it, not only the least but also the second to the least element can be determi...
More

Tricks and Tips: Determining the 2 largest elements and their positions in each matrix row with CUDA Thrust

In this post, we consider the problem of determining the 2 largest numbers AND their position in each row of a matrix. This problem can be solved by using CUDA Thrust. The approach produces two iterators and one vector: 1. d_min_indices_1: iterator pointing to the indices of the last element for each row;  2.d_min_indices_2: iterator pointing to the indices of the second to last element for each row;  3. d_matrix: original matrix but with each row ordered in ascending o...
More

Tricks and Tips: Reduce matrix columns with CUDA

We here reporting 4  approaches for column matrix reduction, 3 of them based on using CUDA Thrust and 1 based on using cublas<t>gemv() with a column of 1's. The CUDA Thrust approaches are the analogous of our previous post: Reduce matrix rows with CUDA with an implicit transposition obtained by thrust::make_permutation_iterator(d_matrix.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), (_1 % Nrows) * Ncols + _1 / Nrows)) The full code is reported on our github...
More

Tricks and Tips: Reduce matrix rows with CUDA

Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope here). Also, an approach using cuBLAS is possible. APPROACH #1 - reduce_by_key This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator. APPROACH #2 - transform This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values ...
More

1D FFTs of columns and rows of a 3D matrix in CUDA

In this post we explore the possibility of using a single cufftPlanMany to perform 1D FFTs of the columns of a 3D matrix. Transformations performed according to cufftPlanMany, that is called like cufftPlanMany(&handle, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_C2C, batch); must obey the Advanced Data Layout . In particular, 1D FFTs are worked out according to the following layout input[b * idist + x * istride] where b addresses the b-th signal and istride is the distan...
More