Tricks and Tips: Using vote intrinsics to check for thread divergence

Below is a way to check about thread divergence within warps using vote intrinsics and in particular the __ballot and __popc intrinsics.
A good explanation on __ballot and __popc is available in the book by Shane Cook: CUDA Programming – Morgan Kaufmann.
The prototype of __ballot is the following

unsigned int __ballot(int predicate);

If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is threadIdx.x.
On the other side, __popc returns the number of bits set withing a 32-bit parameter.
So, by jointly using __ballot__popc and atomicAdd, one can check if a warp is divergent or not.

The following code does check about thread divergence:

#include <cuda.h>
#include <stdio.h>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>

__device__ unsigned int __ballot_non_atom(int predicate)
{
   if (predicate != 0) return (1 << (threadIdx.x % 32));
   else return 0;
}

__global__ void gpu_test_divergency_0(unsigned int* d_ballot, int Num_Warps_per_Block)
{

   int tid = threadIdx.x + blockIdx.x * blockDim.x;
   const unsigned int warp_num = threadIdx.x >> 5;
   atomicAdd(&d_ballot[warp_num+blockIdx.x*Num_Warps_per_Block],__popc(__ballot_non_atom(tid > 2)));
   //atomicAdd(&d_ballot[warp_num+blockIdx.x*Num_Warps_per_Block],__popc(__ballot(tid > 2)));

}

#include <conio.h>

int main(int argc, char *argv[])
{

  unsigned int Num_Threads_per_Block      = 64;
  unsigned int Num_Blocks_per_Grid        = 1;
  unsigned int Num_Warps_per_Block        = Num_Threads_per_Block/32;
  unsigned int Num_Warps_per_Grid         = (Num_Threads_per_Block*Num_Blocks_per_Grid)/32;
  unsigned int* h_ballot = (unsigned int*)malloc(Num_Warps_per_Grid*sizeof(unsigned int));
  unsigned int* d_ballot; cudaMalloc((void**)&d_ballot, Num_Warps_per_Grid*sizeof(unsigned int));

  for (int i=0; i<Num_Warps_per_Grid; i++) h_ballot[i] = 0;

  cudaMemcpy(d_ballot, h_ballot, Num_Warps_per_Grid*sizeof(unsigned int), cudaMemcpyHostToDevice);

  gpu_test_divergency_0<<<Num_Blocks_per_Grid,Num_Threads_per_Block>>>(d_ballot,Num_Warps_per_Block);

  cudaMemcpy(h_ballot, d_ballot, Num_Warps_per_Grid*sizeof(unsigned int), cudaMemcpyDeviceToHost);

  for (int i=0; i<Num_Warps_per_Grid; i++) {

     if ((h_ballot[i] == 0)||(h_ballot[i] == 32)) std::cout << "Warp " << i << " IS NOT divergent- Predicate true for " << h_ballot[i] << " threadsn";

     else std::cout << "Warp " << i << " IS divergent - Predicate true for " << h_ballot[i] << " threadsn";

}

  getch();
  return EXIT_SUCCESS;

}

  
For compute capability < 2.0 cards, use the non-intrinsic device function __ballot_non_atom which is equivalent to __ballot, since __ballot is available only for compute capability >= 2.0.
In other words, if you have a card with compute capability >= 2.0, please uncommented the instruction using __ballot in the kernel function.

Finally, note that the vote intrinsics instructions in the kernel functions can be added to any kernel to evaluate thread divergence.

Leave a Reply

Your email address will not be published. Required fields are marked *