The Finite Difference Time Domain (FDTD) method is a very popular and rather intuitive method to solve partial differential equations describing time-evolving phenomena.

The relevant partial differential equations are discretized in both time and space by finite differences.

“Update equations” expressing the (unknown) future fields in terms of (known) past fields are established.

We here focus on the electromagnetic case, in which Maxwell’s equations should be solved, although FDTD is very often used also in other fields: in fluid dynamics when solving the Navier-Stokes’s equations, for example.

We skip any detail on the theory of FDTD as applied to Maxwell’s equations. Many books have been published so far.

If you would like to freely access an introductory book on this topic, we recommend *Understanding the Finite-Difference Time-Domain Method*, by J.B. Schneider, available for download on EECS Web Site.

We here just spot on hot points which need to be addressed when implementing a CUDA version of the method. We also consider the 2D case for simplicity with *z*-directed electric field excitation.

The update equation for the *x* and *y* components of the magnetic field, Hx and Hy respectively, is

// --- Update for Hy and Hx for(int i=n1; i<=n2; i++) for(int j=n11; j<=n21; j++){ Hy[i*ydim+j]=A[i*ydim+j]*Hy[i*ydim+j]+B[i*ydim+j]*(Ezx[(i+1)*ydim+j]-Ezx[i*ydim+j]+Ezy[(i+1)*ydim+j]-Ezy[i*ydim+j]); Hx[i*ydim+j]=G[i*ydim+j]*Hx[i*ydim+j]-H[i*ydim+j]*(Ezx[i*ydim+j+1]-Ezx[i*ydim+j]+Ezy[i*ydim+j+1]-Ezy[i*ydim+j]); }

`Where xdim and ydim are the dimensions of the computational domain, the matrices A, B, G and H are needed for the implementation of the Perfectly Matched Layer (PML) absorbing boundary conditions and n1, n2, n11 and n21 are representative of the computational region comprised within the wavefront. We have to translate it to a CUDA kernel. A first, straightforward implementation is the following`

/* Evaluates the thread indices */ int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x; int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y; if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) { Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h[(idx+1)*ydim+idy] -Ezx_h[idx*ydim+idy] +Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]); Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h[idx*ydim+idy+1]-Ezx_h[idx*ydim+idy] +Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }

This solution seems to be unsatisfactory, at least due to the need of accessing twice the global memory to get Ezx_h and Ezy_h.

It would be perhaps better to move both the variables to the shared memory which is much more convenient when repeated accesses are required. However, when using the shared memory, we have to take into account that neighboring thread blocks talk each other by neighboring elements.

In other words, we need a convenient way to exploit thread parallelism to move a (BS_X+1) * (BS_Y+1) global memory region (where the 1 more row and column are the neighboring elements we are talking about) by a BS_X * BS_Y sized thread block.

A possible solution is the following:

/* Evaluates the thread indices */ int i = threadIdx.x; int j = threadIdx.y; int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x; int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y; int index1 = j*BLOCK_SIZE_Y+i; int i1 = (index1)%(BLOCK_SIZE_X+1); int j1 = (index1)/(BLOCK_SIZE_Y+1); int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1); int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1); __shared__ double Hx_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1]; __shared__ double Hy_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1]; if (((blockIdx.x*BLOCK_SIZE_X+i1)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j1)<ydim)) { Hx_h_shared[i1][j1]=Hx_h[(blockIdx.x*BLOCK_SIZE_X+i1)*ydim+(blockIdx.y*BLOCK_SIZE_Y+j1)]; Hy_h_shared[i1][j1]=Hy_h[(blockIdx.x*BLOCK_SIZE_X+i1)*ydim+(blockIdx.y*BLOCK_SIZE_Y+j1)]; } if (((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))&&(((blockIdx.x*BLOCK_SIZE_X+i2)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j2)<ydim))){ Hx_h_shared[i2][j2]=Hx_h[(blockIdx.x*BLOCK_SIZE_X+i2)*xdim+(blockIdx.y*BLOCK_SIZE_Y+j2)]; Hy_h_shared[i2][j2]=Hy_h[(blockIdx.x*BLOCK_SIZE_X+i2)*xdim+(blockIdx.y*BLOCK_SIZE_Y+j2)]; } __syncthreads(); if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) { Ezx_h[(idx+1)*ydim+idy+1]=C_h[(idx+1)*ydim+idy+1]*Ezx_h[(idx+1)*ydim+idy+1]+D_h[(idx+1)*ydim+idy+1]*(-Hx_h[(idx+1)*ydim+idy+1]+Hx_h[(idx+1)*ydim+idy]); Ezy_h[(idx+1)*ydim+idy+1]=E_h[(idx+1)*ydim+idy+1]*Ezy_h[(idx+1)*ydim+idy+1]+F_h[(idx+1)*ydim+idy+1]*(Hy_h[(idx+1)*ydim+idy+1]-Hy_h[idx*ydim+idy+1]); }

The (i1,j1) and (i2,j2) trick enables accessing the required (BS_X+1) * (BS_Y+1) global memory region by BS_X * BS_Y threads in parallel.

The barrier synchronization __syncthreads() is required to avoid overlapping computation and global memory read/write operations since no synchronization mechanism is available at the grid level.

Of the two solutions, the latter is faster. The kernel execution times for the two cases are reported in the following table

Size |
First solution |
Second solution |

512×512 | 4.2ms | 9.8ms |

1024×1024 | 20.9ms | 14.5ms |

2048×2048 | 82.8ms | 57.4ms |

Other useful information to compare the two solutions comes from the use of the Visual Profiler and, in particular, from the Instructions Per Clock (IPC) counter.

IPC |
First solution |
Second solution |

Maximum | 1.208 | 1.647 |

Average | 0.629 | 1.501 |

Minimum | 0.309 | 1.404 |

If you want to know more, try it by yourself! We provide you a free FDTD Simulator.