Matrix transposition is a very common operation in linear algebra.
From a numerical point of view, it is a memory bound problem since there is practically no arithmetics in it and the operation essentially consists of rearranging the layout of the matrix in memory.
Due to the particular architecture of a GPU and to the cost of performing global memory operations, matrix transposition admits no naive implementation if performance is of interest.
We here compare two different possibilities of performing matrix transposition in CUDA, one using the Thrust library and one using cuBLAS cublas<t>geam.
The full code we have set up to perform the comparison is downloadable from a Visual Studio 2010 project.
Here are the results of the tests performed on a Kepler K20c card:
As you can see, the cuBLAS cublas<t>geam definitely outperforms the solution using Thrust and proves to be a very efficient way to perform matrix transposition in CUDA.