Barnes-Hut N-body approach in CUDA: Construction of the bounding box

The relevant __global__ function is BoundingBoxKernel which is appointed to find the bounding box of the bodies. Essentially it performs reduction operations along the bodies position coordinates.

To illustrate how it works, let us consider the following simple example. Suppose to launch 4 threads. In this case, inc = 4.

Each thread initializes the register variables minx, maxx, miny, maxy, minz, maxz as The first for loop degenerates into a single cycle with j=i, i being the thread index.
In this cycle, each thread evaluates the maxima and minima between their original temporary maxima and minima (minx, maxx, miny, maxy, minz, maxz) and the array values corresponding to their thread index (posx[i], posy[i], posz[i]).
After that, the values of the register variables are moved to shared memory so that a subsequent reduction in shared memory can be performed. At this point, we have six shared memory arrays filled as in the previous table.

In the subsequent for loop, j assumes the values of 2 and 1. So, due to the if statement, only the threads with indices 0 and 1 (so half of the active threads) will operate.

Now consider first the threads i = 0. It performs two cycles, one with j=2 and one with j=1 and modifies only the values with 0 index of the shared memory arrays and its own register variables.
For j=2, it reduces the register variables and the shared memory elements with k=2, so it leads to For j=1, it reduces the register variables and the shared memory elements with k=1, so it leads to The thread with i=1 performs only one cycle with j=2 and it reduces the register variables and the shared memory elements with k=3, so it leads to Finally, the last part is executed by only the thread with relative (to the block) thread index i=0. In the considered example, there is only one block so that such a thread reduces the yellow and red columns in the above table.

The radius is defined as the maximum radius along the x-, y- and z-axes. The mass of the root node is initialized to -1, the position of the root node is initialized as the center of the bounding box and the 8 possible children are initialized to -1 