The VS project Gaussian elimination with CUDA, that you may find in download section, contains CPU and GPU routines for solving a linear system of equations by Gaussian elimination without pivoting.
Besides providing standard CPU Gaussian elimination and solution of an upper triangular system, sequential and parallel codes have been developed based on the paper:
Manuel Carcenac, “From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices“, Journal of Supercomputing, DOI 10.1007/s11227-013-1043-3
and on the presentation:
“Application: linear system resolution with Gauss method” available at the Author’s webpage.
Five GPU approaches have been targeted, namely:
- 1) “Tiling” with kernels not using shared memory;
- 2) “Tiling” with kernels using shared memory;
- 3) “Brute-force” cuBLAS based approach;
- 4) “Tiling” approach using cuBLAS;
- 5) “Tiling & stripping” approach using cuBLAS;
The performance of the five approaches has been tested on a Kepler K20c GPU.
The following table summarizes the results
|Matrix size||CPU||Kernel no shared||Kernel shared||cuBLAS v1||cuBLAS v2||cuBLAS v3|
The above timing refers to the transformation of the system matrix to an upper triangular matrix only.
This test is not aimed at showing the trend in throughput against the matrix size but just at pointing out how the timing gradually decreases by improving the solution strategy.