Bilinear interpolation in CUDA: an interview

Q: Bilinear interpolation is an extension of the well known linear interpolation scheme to functions of two variables. Could you please provide the math behind so that how implementing it on CPU and GPU will be clearer?

A: Sure. Assume that the original function T(x, y) is sampled at the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers.
For each value of y, you can first use `0 <= a < 1` to represent an arbitrary point i + a comprised between i and i + 1.
Then, a linear interpolation along the y = j axis (which is parallel to the x axis) at that point can be performed obtaining:

(1)

where r(x, y) is the function interpolating the samples of T(x, y). Do the same for the line y = j + 1, obtaining:

(2)

Now, for each i + a, an interpolation along the y axis can be performed on the samples r(i+a, j) and r(i+a, j+1).

Accordingly, if one uses 0 <= b < 1 to represent an arbitrary point j + b located between j and j + 1, then a linear interpolation along the x = i + a axis (which is parallel to the y axis) can be worked out, so getting the final result:

(3)

Note that the relations between i, j, a, b, x and y are the following

(4)

Q: Quite clear. Before going to CUDA, could you please now provide first the C/C++ sequential implementation to acclimate a bit with the problem?

A: Yes. Remember that this implementation, as well as the following CUDA ones, assume, as I mentioned at the beginning, that the samples of T are located on the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers (unit spacing).
Also, the routine reported below is provided in single precision, complex (float2) arithmetics, but you can be easily cast it in other arithmetics of interest.

void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data,
                                         float * __restrict__ h_xout, float * __restrict__ h_yout,
                                         const int M1, const int M2, const int N1, const int N2){

    float2 result_temp1, result_temp2;
    for(int k=0; k<N2; k++){
        for(int l=0; l<N1; l++){

            const int    ind_x = floor(h_xout[k*N1+l]);
            const float a      = h_xout[k*N1+l]-ind_x;

            const int    ind_y = floor(h_yout[k*N1+l]);
            const float b      = h_yout[k*N1+l]-ind_y;

            float2 h00, h01, h10, h11;
            if (((ind_x)   < M1)&&((ind_y)   < M2))    h00 = h_data[ind_y*M1+ind_x];        else    h00 = make_float2(0.f, 0.f);
            if (((ind_x+1) < M1)&&((ind_y)   < M2))    h10 = h_data[ind_y*M1+ind_x+1];        else    h10 = make_float2(0.f, 0.f);
            if (((ind_x)   < M1)&&((ind_y+1) < M2))    h01 = h_data[(ind_y+1)*M1+ind_x];    else    h01 = make_float2(0.f, 0.f);
            if (((ind_x+1) < M1)&&((ind_y+1) < M2))    h11 = h_data[(ind_y+1)*M1+ind_x+1];    else    h11 = make_float2(0.f, 0.f);

            result_temp1.x = a * h10.x + (-h00.x * a + h00.x);
            result_temp1.y = a * h10.y + (-h00.y * a + h00.y);

            result_temp2.x = a * h11.x + (-h01.x * a + h01.x);
            result_temp2.y = a * h11.y + (-h01.y * a + h01.y);

            h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
            h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

        }    
    }
}

Q: What are those if/else statements for? Are they really needed?

A: Yes. They are simply boundary checkings. If the sample falls outside the [0, M1-1] x [0, M2-1], then it is set to 0.

Q: Got it. Now I think I’m ready for the CUDA implementations. Go ahead, please.

A: This is a “standard” CUDA implementation tracing the above CPU one. No usage of texture memory. Very simple.

__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data,
                                                  const float * __restrict__ d_xout, const float * __restrict__ d_yout,
                                                  const int M1, const int M2, const int N1, const int N2)
{
   const int l = threadIdx.x + blockDim.x * blockIdx.x;
   const int k = threadIdx.y + blockDim.y * blockIdx.y;

   if ((l<N1)&&(k<N2)) {

       float2 result_temp1, result_temp2;

       const int    ind_x = floor(d_xout[k*N1+l]);
       const float  a      = d_xout[k*N1+l]-ind_x;

       const int    ind_y = floor(d_yout[k*N1+l]);
       const float  b      = d_yout[k*N1+l]-ind_y;

       float2 d00, d01, d10, d11;
       if (((ind_x)   < M1)&&((ind_y)   < M2))    d00 = d_data[ind_y*M1+ind_x];        else    d00 = make_float2(0.f, 0.f);
       if (((ind_x+1) < M1)&&((ind_y)   < M2))    d10 = d_data[ind_y*M1+ind_x+1];        else    d10 = make_float2(0.f, 0.f);
       if (((ind_x)   < M1)&&((ind_y+1) < M2))    d01 = d_data[(ind_y+1)*M1+ind_x];    else    d01 = make_float2(0.f, 0.f);
       if (((ind_x+1) < M1)&&((ind_y+1) < M2))    d11 = d_data[(ind_y+1)*M1+ind_x+1];    else    d11 = make_float2(0.f, 0.f);

       result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
       result_temp1.y = a * d10.y + (-d00.y * a + d00.y);

       result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
       result_temp2.y = a * d11.y + (-d01.y * a + d01.y);

       d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
       d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

   }
}

Q: In this implementation, the samples reside on global memory. But I heard that global memory can be fetched also by aid of texture memory so to exploit data locality and have some benefits? Is that true and how it can be done?

A: You are right. You can fetch global memory by aid of the texture cache and have some benefits, especially for earlier architectures. Here is the code:

__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result,
                                                                const float * __restrict__ d_xout, const float * __restrict__ d_yout,
                                                                const int M1, const int M2, const int N1, const int N2)
{
   const int l = threadIdx.x + blockDim.x * blockIdx.x;
   const int k = threadIdx.y + blockDim.y * blockIdx.y;

   if ((l<N1)&&(k<N2)) {

       float2 result_temp1, result_temp2;

       const int    ind_x = floor(d_xout[k*N1+l]);
       const float  a      = d_xout[k*N1+l]-ind_x;

       const int    ind_y = floor(d_yout[k*N1+l]);
       const float  b      = d_yout[k*N1+l]-ind_y;

       const float2 d00  = tex2D(d_texture_fetch_float,ind_x,ind_y);    
       const float2 d10  = tex2D(d_texture_fetch_float,ind_x+1,ind_y);
       const float2 d11     = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1);
       const float2 d01  = tex2D(d_texture_fetch_float,ind_x,ind_y+1);

       result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
       result_temp1.y = a * d10.y + (-d00.y * a + d00.y);

       result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
       result_temp2.y = a * d11.y + (-d01.y * a + d01.y);

       d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
       d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

   }
}

Q: Fine. You are basically replacing d_data[ind_y*M1+ind_x] with tex2D(d_texture_fetch_float,ind_x,ind_y), where d_texture_fetch_float is the texture variable having global scoping, isn’t it?

A: Right. This is the same implementation as above, but the global memory is now accessed by the texture cache, as you noticed. T[i, j] is now accessed as:

    tex2D(d_texture_fetch_float,ind_x,ind_y);

where, of course ind_x = i and ind_y = j.

Q: But did you forget the boundary checkings?

A: No. We need no if/else boundary checking, because the texture will automatically clamp to zero the samples falling outside the [0, M1-1] x [0, M2-1] sampling region if you bind the texture as follows:

void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2)
{
    size_t pitch;
    float* data_d;
    gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
    gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch));
    d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
    d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
    gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));

}

In particular, the two instructions:

    d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
    d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;

are doing the job.

Q: Are you also exploiting the filtering capabilities of the texture? I heard it can automagically perform bilinear interpolation directly in hardware, which will be very fast, but at the price of some loss of precision.

A: No. The hard-wired texture filtering capabilities are not exploited here. The routine below has the same precision as the above one and, as I already mentioned, could result somewhat faster than that on old CUDA architectures.

Q: Ok, but finally, could you be so kind to provide also an example on how exploiting the mentioned filtering features of the texture?

A: Sure. Here is the code, very short and simple:

__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result,
                                                                 const float * __restrict__ d_xout, const float * __restrict__ d_yout,
                                                                 const int M1, const int M2, const int N1, const int N2)
{
   const int l = threadIdx.x + blockDim.x * blockIdx.x;
   const int k = threadIdx.y + blockDim.y * blockIdx.y;

   if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); }
}

Remember in this case to bind the texture as follows:

void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2)
{
    size_t pitch;
    float* data_d;
    gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
    gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch));
    d_texture_interp_float.addressMode[0] = cudaAddressModeClamp;
    d_texture_interp_float.addressMode[1] = cudaAddressModeClamp;
    d_texture_interp_float.filterMode = cudaFilterModeLinear;    // --- Enable linear filtering
    d_texture_interp_float.normalized = false;                    // --- Texture coordinates will NOT be normalized
    gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));

}

to enable texture filtering.

Q: Why are you offsetting the interpolation coordinates by 0.5?

A: Because the interpolation formula implemented by texture filtering is the same as derived above, but now:
(5)


where xB = x – 0.5 and yB = y – 0.5.
This explains the 0.5 offset in the instruction:

tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f)

Q: Ok. In understood. But where/why does the loss of accuracy occur?

A: As mentioned in the CUDA Programming Guide, a and b are stored in 9-bit fixed point format with 8 bits of fractional value.

Q: Everything is clear now. I’m now going to integrate your routines in my own codes 🙂

2 thoughts on “Bilinear interpolation in CUDA: an interview

  1. Bilgin says:

    So what are dx_out and dy_out?

    1. OrangeOwl says:

      Hi Bilgin, they are x and y coordinates of the output interpolation points on the GPU.

Leave a Reply

Your email address will not be published. Required fields are marked *