Concurrency in CUDA multi-GPU executions

Achieving concurrent executions on multi-GPU systems is a very appealing feature since it can further linearly scale the execution time of embarrassingly parallel problems.

We have done some experiments on achieving concurrent execution on a cluster of 4 Kepler K20c GPUs. We have considered 8 test cases, whose corresponding codes along with the profiler timelines are reported below.

Test case #1 – “Breadth-first” approach – synchronous copy

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test1.cu

MultiGPU_Test1

Profiler timeline

As it can be seen, the use of cudaMemcpy does not enable achieving concurrency in copies, but concurrency is achieved in kernel execution.

Test case #2 – “Depth-first” approach – synchronous copy

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test2.cu

MultiGPU_Test2

Profiler timeline

This time, concurrency is not achieved neither within memory copies nor within kernel executions.

Test case #3 – “Depth-first” approach – asynchronous copy with streams

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test3.cu

MultiGPU_Test3

Profiler timeline

Concurrency is achieved, as expected.

Test case #4 – “Depth-first” approach – asynchronous copy within default streams

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test4.cu

MultiGPU_Test4

Profiler timeline

Despite using the default stream, concurrency is achieved.

Test case #5 – “Depth-first” approach – asynchronous copy within default stream and unique host cudaMallocHosted vector

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test5.cu

MultiGPU_Test5

Profiler timeline

Concurrency is achieved once again.

Test case #6 – “Breadth-first” approach with asynchronous copy with streams

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test6.cu

MultiGPU_Test6

Profiler timeline

Concurrency achieved, as in the corresponding “depth-first” approach.

Test case #7 – “Breadth-first” approach – asynchronous copy within default streams

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test7.cu

MultiGPU_Test7

Profiler timeline

Concurrency is achieved, as in the corresponding “depth-first” approach.

Test case #8 – “Breadth-first” approach – asynchronous copy within the default stream and unique host cudaMallocHosted vector

Code – https://github.com/OrangeOwlSolutions/MultiGPU/blob/master/MultiGPU_Test8.cu

MultiGPU_Test8

Profiler timeline

Concurrency is achieved, as in the corresponding “depth-first” approach.

Conclusion

Using asynchronous copies guarantees concurrent executions, either using purposely created streams or using the default stream.

Note

In all the above examples, I have taken care to provide enough work to do the GPUs, either in terms of copies and of computing tasks. Failing to provide enough work to the cluster may prevent observing concurrent executions.

Leave a Reply

Your email address will not be published. Required fields are marked *