CUDA Texture memory fetches vs global memory fetches: the complex case

At the time of compute capability 1.x, memory fetches could benefit of texture cache to accelerate the accesses.
Since compute capability 2.0, global memory fetches can benefit of L1 and L2 caches, so that texture memory has reduced its relative importance [1].
The question then naturally arises on whether texture memory is still worth using with newer GPU generations or not.

Answering this question is highly application-dependent.
Below, we focus on the particular test-case of 1D linear interpolation of complex functions and compare the usage of texture fetches and global memory fetches for such a purpose.
We are using texture filtering for linear interpolation also for the sake of completeness.
If you want to perform your own comparisons, you can find the full Visual Studio 2010 code employed in the tests, in the downloads page.

The kernels we are comparing are 4, 2 of them using global memory and 2 using texture memory and are distinguished according to the way complex values are accessed (1 float2 fetch or 2 floats fetch).

We have considered 4 different GPUs, namely,

  1. 1. GeForce GT210 (cc 1.2)
  2. 2. GeForce GT540M (cc 2.1)
  3. 3. Tesla C2050 (cc 2.0)
  4. 4. Kepler K20c (cc 3.5)

The results are reported in the figures below:

GeForce GT210 (cc 1.2)

GeForce GT210 (cc 1.2)

GeForce GT540M (cc 2.1)

GeForce GT540M (cc 2.1)

Tesla C2050 (cc 2.0)

Tesla C2050 (cc 2.0)

Kepler K20c (cc 3.5)

Kepler K20c (cc 3.5)

As it can be seen, using textures as cache with older compute capabilities improves over the use of global memory, while the two solutions are pretty equivalent for the newest architectures.
Of course, this example is not exhaustive and there may be in practice other cases when the former or the latter should be preferred for particular applications.

Pay attention: due to the limitations in memory allocation by cudaMallocArray, you will not be able to use texture filtering for arrays with lengths which are powers of 2 larger than 8192*8. 
If you want to use the provided code for 8192*16 on, then just comment the following lines:

cudaArray* data_d_cudaArray = NULL; gpuErrchk(cudaMallocArray (&data_d_cudaArray, &data_d_texture_filtering.channelDesc, M, 1));
gpuErrchk(cudaMemcpyToArray(data_d_cudaArray, 0, 0, data, sizeof(float2)*M, cudaMemcpyHostToDevice));

for (int k=0; k<Nit; k++) linear_interpolation_function_GPU_texture_filtering(result_d_texture_filtering, data_d_cudaArray, x_in_d, x_out_d, M, N);

 

[1] S. Cook, “CUDA Programming”, Morgan Kaufmann

Leave a Reply

Your email address will not be published. Required fields are marked *