On our GitHub website we are posting a fully worked code concerning the optimization of the solution approach for the 2D heat equation.
Five approaches are considered, using:
Global memory, essentially the OP's approach;
Shared memory of size BLOCK_SIZE_X x BLOCK_SIZE_Y not loading the halo regions;
Shared memory of size BLOCK_SIZE_X x BLOCK_SIZE_Y loading the halo regions;
Shared memory of size (BLOCK_SIZE_X + 2) x (BLOCK_SIZE_Y + 2) loading the halo regions;
Texture memory.
E...

More
# Mathematics

# Solution of linear systems of equations in CUDA

There are essentially three approaches to solve linear systems of equations in CUDA.
Approach nr. 1
As of February 2015, CUDA 7.0 (now in Release Candidate) offers the new cuSOLVER library including the possibility of calculating the QR decomposition of a matrix.
This, in conjunction with the cuBLAS library, enables to solve a linear system according to the guidelines expounded in Appendix C of the cuSOLVER User's Guide.
The steps to follow are three:
1) geqrf: it calculates the QR decomposit...

More
# Implementations of the BiConjugate Gradient Stabilized optimizer on CPU and GPU

The BiConjugate Gradient Stabilized (BiCGStab) optimizer, see http://en.wikipedia.org/wiki/Biconjugate_gradient_stabilized_method, can be easily and efficiently implemented on both CPU and GPU by making a massive use of BLAS/cuBLAS since the code is based on calculating matrix-vector multiplications, scalar products and norms.
For the GPU side, it is worth having one's own implementation of the BiCGStab optimizer since the sample contained in the CUDA SDK refers to sparse linear systems, while ...

More
# Gaussian elimination with CUDA

The VS project Gaussian elimination with CUDA, that you may find in download section, contains CPU and GPU routines for solving a linear system of equations by Gaussian elimination without pivoting.
Besides providing standard CPU Gaussian elimination and solution of an upper triangular system, sequential and parallel codes have been developed based on the paper:
Manuel Carcenac, "From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resol...

More
# Lagrange polynomials for interpolation

Linear interpolation consists of approximating a function [latex size="1"]f(x)[/latex] as:
[latex size="1"]f(x)=sum_{i=1}^{N}a_i phi_i(x);;;;;;;;;;;(1)[/latex]
where the [latex size="1"]a_i[/latex]'s are the interpolation coefficients and the [latex size="1"]phi_i[/latex]'s are prefixed interpolation functions.
Lagrange interpolation, which is one of the simplest and mostly employed interpolation methods, consists of finding the interpolation coefficients as the solution of the linear system
[...

More
# Finite Difference Time Domain Method

The Finite Difference Time Domain (FDTD) method is a very popular and rather intuitive method to solve partial differential equations describing time-evolving phenomena.
The relevant partial differential equations are discretized in both time and space by finite differences.
“Update equations” expressing the (unknown) future fields in terms of (known) past fields are established.
We here focus on the electromagnetic case, in which Maxwell's equations should be solved, although FDTD is very ...

More