The calculation of the Singular Value Decomposition (SVD) of a matrix is at the basis of many computations and approaches in applied science. One example is the regularized solution of linear systems of equations. Another is Principal Component Analysis.

Many times, the applications requiring the SVD calculation deal with large matrices and/or request the SVD computation in an iterative process.

Fortunately, the SVD can be quickly computed in CUDA using the routines provided in the cuSOLVER library. Below, we provide a representative example relevant in common situations.

Before proceeding further, two points to remember:

- gesvd routines assume Nrows >= Ncols. Of course, the case Nrows < Ncols can be dealt with by matrix transposition, for example by cublas<t>geam();
- column major ordering is assumed.

The full example is contained in the SVD.cu file.

The first step towards SVD calculation in CUDA is to initialize the cuSOLVER, as is required for any other routine of the cuSOLVER library:

cusolverDnHandle_t solver_handle;cusolverDnCreate(&solver_handle);

The second step is to allocate the buffer space required by the SVD calculation routine:

cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));double *work; gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));

The buffer space dimension is calculated by cusolverDnDgesvd_bufferSize.

The third step is the final SVD calculation:

cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));int devInfo_h = 0; gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost)); if (devInfo_h != 0) std::cout << "Unsuccessful SVD execution\n\n";

Note that cusolverDnDgesvd returns the matrix V transposed. Also, U is the matrix of the *left* singular vectors, namely, the singular vectors spanning the output space of the matrix, while V is the matrix of the *right* singular vectors, namely, the singular vectors spanning the input space of the matrix.

For the example at SVD.cu, the output is:

Singular valuesd_S[0] = 314.891390715956 d_S[1] = 14.4118456776669 d_S[2] = 0.842607258487678 d_S[3] = 0.0277423461350813 d_S[4] = 0.000710344049837318 U[0,0]=-0.026900 U[1,0]=-0.043393 U[2,0]=-0.091636 U[3,0]=-0.181098 U[4,0]=-0.319440 U[5,0]=-0.513289 U[6,0]=-0.768565 U[0,1]=-0.366199 U[1,1]=-0.428964 U[2,1]=-0.477524 U[3,1]=-0.454763 U[4,1]=-0.323752 U[5,1]=-0.055840 U[6,1]=0.372983 U[0,2]=0.747999 U[1,2]=0.316640 U[2,2]=-0.072851 U[3,2]=-0.312084 U[4,2]=-0.353727 U[5,2]=-0.162134 U[6,2]=0.293467 U[0,3]=-0.530903 U[1,3]=0.565260 U[2,3]=0.392143 U[3,3]=-0.039990 U[4,3]=-0.324668 U[5,3]=-0.264268 U[6,3]=0.260770 U[0,4]=0.152553 U[1,4]=-0.590073 U[2,4]=0.467756 U[3,4]=0.350750 U[4,4]=-0.204309 U[5,4]=-0.423010 U[6,4]=0.256984 U[0,5]=0.007732 U[1,5]=-0.054514 U[2,5]=0.051584 U[3,5]=0.281387 U[4,5]=-0.717133 U[5,5]=0.607745 U[6,5]=-0.177468 U[0,6]=-0.021961 U[1,6]=0.207773 U[2,6]=-0.618898 U[3,6]=0.677639 U[4,6]=-0.081169 U[5,6]=-0.298327 U[6,6]=0.136131 V[0,0]=-0.349562 V[0,1]=-0.395802 V[0,2]=-0.442097 V[0,3]=-0.488660 V[0,4]=-0.535638 V[1,0]=0.637619 V[1,1]=0.405149 V[1,2]=0.114078 V[1,3]=-0.224424 V[1,4]=-0.604909 V[2,0]=0.635177 V[2,1]=-0.323686 V[2,2]=-0.505308 V[2,3]=-0.213766 V[2,4]=0.436743 V[3,0]=-0.258432 V[3,1]=0.705292 V[3,2]=-0.279236 V[3,3]=-0.497617 V[3,4]=0.331934 V[4,0]=0.031788 V[4,1]=-0.277461 V[4,2]=0.676925 V[4,3]=-0.646162 V[4,4]=0.215062

These results can be compared with the following simple Matlab code:

clear allclose allclc Nrows = 7; Ncols = 5; A = zeros(Nrows, Ncols); for j = 0 : Nrows - 1 for i = 0 : Ncols - 1 A(j + 1, i + 1) = (i + j * j) * sqrt(i + j); end end [U, S, V] = svd(A);

which outputs:

314.8914 14.4118 0.8426 0.0277 0.0007U =-0.0269 0.3662 -0.7480 0.5309 0.1526 0.0076 -0.0220 -0.0434 0.4290 -0.3166 -0.5653 -0.5901 -0.0528 0.2082 -0.0916 0.4775 0.0729 -0.3921 0.4678 0.0466 -0.6193 -0.1811 0.4548 0.3121 0.0400 0.3508 0.2868 0.6754 -0.3194 0.3238 0.3537 0.3247 -0.2043 -0.7178 -0.0754 -0.5133 0.0558 0.1621 0.2643 -0.4230 0.6053 -0.3032 -0.7686 -0.3730 -0.2935 -0.2608 0.2570 -0.1764 0.1375V =-0.3496 -0.6376 -0.6352 0.2584 0.0318 -0.3958 -0.4051 0.3237 -0.7053 -0.2775 -0.4421 -0.1141 0.5053 0.2792 0.6769 -0.4887 0.2244 0.2138 0.4976 -0.6462 -0.5356 0.6049 -0.4367 -0.3319 0.2151

As it can be seen, the CUDA and Matlab results coincide for the singular values, but not for the singular vectors. However, the singular vectors span the same spaces.

When I run this code, the performance’s not as good as expected. After profiling it with nvprof, I found out that there’re so many HtoD and DtoH data transfer (with chuck size of only few bytes), thus killing the performance. Do you have any insight into the reason for this behavior? I used CUDA 8 on a K40m

Hi, the SVD routine from cuSOLVER is known not to have good performance, especially as compared to MKL. This is recognized by Joe Eaton, manager of cuSOLVER library in this post: https://devtalk.nvidia.com/default/topic/830894/cusolverdncgesvd-performance-vs-mkl/. If you need better performance, you have two possibilities: Using CULA or using MAGMA. The former library is not for free, while the latter does. We have recently used MAGMA’s SVD routines and experienced very good performance against MKL.

which sentences did U use for compiling an executing this code???

i don’t know how