Implementing a critical section in CUDA

Critical sections are sequences of operations that must be executed sequentially by the CUDA threads.

Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block with threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.

This is illustrated in the code on our GitHub web page . Such a code has two kernel functions: blockCountingKernelNoLock and blockCountingKernelLock. The former does not use a critical section to increase the counter and, as you will see, returns wrong results. The latter encapsulates the counter increase within a critical section and so produces correct results. But how does the critical section work?

The critical section is governed by a global state d_state. Initially, the state is 0. Furthermore, two __device__ methods, lock and unlock, can change this state. The lock and unlock methods can be invoked only by a single thread within each block and, in particular, by the thread with local thread index threadIdx.x == 0.

Randomly during the execution, one of the threads with local thread index threadIdx.x == 0 and global thread index, say, t will be the first invoking the lock method. In particular, it will launch atomicCAS(d_state, 0, 1). Since initially d_state == 0, then d_state will be updated to 1, atomicCAS will return 0 and the thread will exit the lock function, passing to the update instruction. In the meanwhile such a thread performs the mentioned operations, all the other threads of all the other blocks with threadIdx.x == 0 will execute the lock method. They will however find a value of d_state equal to 1, so that atomicCAS(d_state, 0, 1) will perform no update and will return 1, so leaving these threads running the while loop. After that thread t accomplishes the update, then it executes the unlock function, namely atomicExch(d_state, 0), thus restoring d_state to 0. At this point, randomly, another of the threads with threadIdx.x == 0 will lock again the state.

The code on our GitHub web page contains also a third kernel function, namely blockCountingKernelDeadlock. However, this is another wrong implementation of the critical section, leading to deadlocks. Indeed, we recall that warps operate in lockstep and they synchronize after every instruction. So, when we execute blockCountingKernelDeadlock, there is the possibility that one of the threads in a warp, say a thread with local thread index t≠0, will lock the state. Under this circumstance, the other threads in the same warp of t, including that with  threadIdx.x == 0, will execute the same while loop statement as thread t, being the execution of threads in the same warp performed in lockstep. Accordingly, all the threads will wait for someone to unlock the state, but no other thread will be able to do so, and the code will be stuck in a deadlock.



Leave a Reply

Your email address will not be published. Required fields are marked *