Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope here). Also, an approach using cuBLAS is possible.
APPROACH #1 – reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 – transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 – inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 – cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1’s.
THE FULL CODE
The full code is available on our github page.
TIMING RESULTS (tested on a Kepler K20c)
|Matrix size||#1||#1-v2||#2||#3||#4||#4 (no plan)|
|100 x 100||0.63||1.00||0.10||0.18||139.4||0.098|
|1000 x 1000||1.25||1.12||3.25||1.04||101.3||0.12|
|5000 x 5000||8.38||15.3||16.05||13.8||111.3||1.14|
|100 x 5000||1.25||1.52||2.92||1.75||101.2||0.40|
|5000 x 100||1.35||1.99||0.37||1.74||139.2||0.14|
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns.
The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.