Let us consider the example in last post where it has been underlined how using asynchronous copies enables achieving true multi-GPU concurrency. In particular, let us consider Test case #8 of that post.
The full code of Test case #8 is available on our GitHub website, while the profiler timeline is reported here for the sake of clarity:
The full code for the timing example here reported is available on our GitHub website.
Timing the asynchronous copies - concurrency is destroyed
Now...

More
# NVIDIA Visual Profiler

# Tricks and Tips: Using vote intrinsics to check for thread divergence

Below is a way to check about thread divergence within warps using vote intrinsics and in particular the __ballot and __popc intrinsics.
A good explanation on __ballot and __popc is available in the book by Shane Cook: CUDA Programming - Morgan Kaufmann.
The prototype of __ballot is the following
unsigned int __ballot(int predicate);
If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is threadIdx.x.
On the other side, __popc returns the number of bits set withing a 3...

More
# Tricks and Tips: Profiling CUDA Matlab Mex file codes

You can profile Matlab mexfiles including CUDA codes using the NVIDIA Visual Profiler by the following procedure:
1. Write your mexfile including CUDA code (see this article for guidelines).
2. Add cudaDeviceReset() at the end of your mexfunction.
3. Write your Matlab .m file end add exit at its end.
4. Launch the NVIDIA Visual Profiler and go to File -> New Session.
5. In File: add the full path of the Matlab executable file, for example C:Program FilesMATLABR2012bbinwin64MATLAB.exe .
...

More