The calculation of the Singular Value Decomposition (SVD) of a matrix is at the basis of many computations and approaches in applied science. One example is the regularized solution of linear systems of equations. Another is Principal Component Analysis.
Many times, the applications requiring the SVD calculation deal with large matrices and/or request the SVD computation in an iterative process.
Fortunately, the SVD can be quickly computed in CUDA using the routines provided in the cuSOLVE...

More
# Cuda Libraries

# CUDA mex function using real data residing on the host and producing real results on the host

In the CUDA_mex_host_to_device GitHub directory, we provide an example on how creating a mex function executing on the GPU when the input real data reside on the host and the final results are returned on the host.
The first thing to do is to recover the pointer to the first element of the real data from the Matlab input array/matrix:
double *h_input = mxGetPr(prhs[0]);
We can also recover the number of elements of the input variable (the input variable can be also a matrix) as:
in...

More
# Count the occurrences of numbers in a CUDA array

We comparing two approaches to count the occurrences of numbers in a CUDA array.
The two approaches use CUDA Thrust:
Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
Using thrust::unique_copy and thrust::upper_bound.
A fully worked example is available on our GitHub page.
The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, we have had the following timings for a number of N = 1048576 array elements:
First ap...

More
# Sorting 2 or 3 arrays by key with CUDA Thrust

We have compared two approaches to sort arrays by key, with the same key. One of those approaches uses thrust::zip_iterator and the other thrust::gather.
We have tested them in the case of sorting two arrays or three arrays. In all the two cases, the approach using thrust::gather has shown to be faster.
The full codes are available on our GitHub website:
2 Arrays solution
3 Arrays solution
In the following, some timing results (NVIDIA GTX 960 card):
Timing in the case of 2 arrays for...

More
# Singular values calculation only of a real matrix with CUDA

Besides the full SVD of a matrix, see SVD of a real matrix, by cusolverDnSgesvd, it is possible also to calculate only the singular values of a matrix.
On our GitHub website, we report a sample code with two calls to cusolverDnSgesvd, one performing the singular values calculation only
cusolverDnSgesvd(solver_handle, 'N', 'N', M, N, d_A, M, d_S, NULL, M, NULL, N, work, work_size, NULL, devInfo)
and one performing the full SVD calculation
cusolverDnSgesvd(solver_handle, 'A', 'A', M, N, d_...

More
# Sorting by key with tuple key and customized comparison operator

Sorting an array by key is possible once an ordering is defined for the tuple key.
Defining an ordering is possible with CUDA Thrust by an overload of the “<” comparison operator.
Accordingly, sorting tuples with CUDA Thrust can be performed by a combination of thrust::sort_by_key, zip iterators and tuples.
Our GitHub web page contains an example on how this can be simply accomplished.

More
# Sorting tuples with CUDA Thrust

Sorting tuples is possible once an ordering is defined for the tuples.
Defining an ordering is possible with CUDA Thrust by an overload of the “<” comparison operator. Accordingly, sorting tuples with CUDA Thrust can be performed by a combination of thrust::sort, zip iterators and tuples.
Our GitHub web page contains an example on how this can be simply accomplished.

More
# CUDA Thrust saxpy with placeholders and lambda expressions

Saxpy, namely z = a * x + y, is a very common operation to be performed in scientific programming.
cuBLAS implements its own saxpy, but it is limited to the case when z = y, so in some circumstances it has to be implemented using a kernel function or using CUDA Thrust.
On our GitHub website, a fully worked example is shown on how implementing saxpy in CUDA using Thrust and, in particular, using the placeholder technique.
A fully worked example is also reported on how implementing saxpy ...

More
# Sorting many small “packed” arrays by key in CUDA

On problem of interest, is that of extending the approach in "Sorting many small arrays by key in CUDA" to the case when multiple arrays must be ordered according to the same key.
Unfortunately, it is not possible to use cub::BlockRadixSort by "packing" the arrays using zip iterators and tuples. Accordingly, we have exploited an helper index approach.
On our GitHub website a fully worked example is reported.

More
# Sorting many small arrays by key in CUDA

In many applications, the problem of sorting many small arrays by key in CUDA arises.
CUB offers a possible solution to face this problem. On our GitHUb website, we report an example that can be reused for this purpose.
This example was constructed around the previously posted "Sorting many small arrays in CUDA".

More