Concurrency in CUDA multi-GPU executions

Achieving concurrent executions on multi-GPU systems is a very appealing feature since it can further linearly scale the execution time of embarrassingly parallel problems. We have done some experiments on achieving concurrent execution on a cluster of 4 Kepler K20c GPUs. We have considered 8 test cases, whose corresponding codes along with the profiler timelines are reported below. Test case #1 - "Breadth-first" approach - synchronous copy Code -