Orange Owl Solutions introduces the new Bluebird library for the fast coding of scientific computing on GPUs and CPUs.
Thanks to the power of CUDA/C++ metaprogramming, you will be able to develope high performance solutions in an easy and intuitive way.

If you need help, or you want to ask questions, suggest future functionalities or report bugs, you may use e-mail or the contact form to get in touch with us.

Bluebird Library (286 downloads)

Main features

The main features of current beta version (0.5) are:

  • C++/CUDA Metaprogramming
  • Simple (Matlab/Octave-like) ways to manage vectors and matrices
  • Stand-alone components for Complex Type management
  • Peer-to-Peer (P2P) communication between GPUs
  • Free demos and examples

We are working to release extensive function library (e.g., interpolation, special functions,…) in next versions.

Introduction by examples: Vector Sum

As first example, to introduce the BB we will refer to the Vector Sum example from the book Cuda by examples by Sanders and Kandrot.
Below, the CPU and the GPU codes performing the sum of two vectors a and b into the resulting vector c:

Vector Sum CPU (C++)

#include <stdio.h>

#define N 10


void add( int *a, int *b, int *c ) {
	int tid = 0;
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += 1;    

int main( void ) {
	int a[N], b[N], c[N];
	// fill the arrays 'a' and 'b'

	for (int i=0; i<N; i++) {
		a[i] = -i;
		b[i] = i*i;

// add routine

	add( a, b, c );

// display the results

	for (int i=0; i<N; i++) {
		printf( "%d + %d = %d\n", a[i], b[i], c[i] );
	return 0;
Vector Sum GPU (CUDA)

#include <cuda.h>
#include <stdio.h>

#define N 10


__global__ void addGPU( int *a, int *b, int *c ) {
	int tid = blockIdx.x;
	if (tid < N) c[tid] = a[tid] + b[tid];

int main( void ) {

	int a[N], b[N], c[N];
	int *dev_a, *dev_b, *dev_c;

// allocate the memory on the GPU

	cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ;

	... same for b and c

// fill the arrays 'a' and 'b' as CPU


// Copy the arrays CPU->GPU

	cudaMemcpy(dev_a, a, N*sizeof(int),cudaMemcpyHostToDevice);

	... same for b

// GPU Add kernel


// Copy the array 'c' GPU->CPU

	cudaMemcpy(c, dev_c, N*sizeof(int),cudaMemcpyDeviceToHost);

// Free the GPU memory     cudaFree(dev_a);

	... same for b and c

// display the results as CPU

	return 0;


By using Bluebird Library, the code performing the vector addition on the GPU results quite simpler than the CUDA version (Note: by using Hmatrix in place of Dmatrix, the code will run on the CPU)

#include "BB.h"
#define N 10

int main( void ) {


	Dmatrix<int> a(N,1);  //'a' is on GPU
	Dmatrix<int> b(N,1);  //'b' is on GPU	
	Dmatrix<int> c(N,1);  //'c' is on GPU

	//a(i)=-i (initialized directly on GPU!)
	//b(i)=i*i (initialized directly on GPU!)


// display the results

	for (int i=0; i<N; i++) {
		printf( "%d + %d = %d\n", a(i), b(i), c(i) );

	return 0;


Let’s introduce some of the main features of the BB by following the main steps of the shown codes.

Introduction by examples: Matrix Declaration

In both the CPU and GPU versions, the main task is to declare the vectors a, b and c. Furthermore, for the GPU case, we should also allocate the memory on the GPU by using the cudaMalloc function.

By using Bluebird Library,
each matrix (or array) can be easily declared and allocated (on the host or on the device) by using the following syntax:

// 'x' is a vector of N elements (column vector) of int allocated on the CPU (Host --> H)
Hmatrix x(N,1); 

//'y' is a matrix of N by M elements of double allocated on the GPU (Device --> D)  
Dmatrix y(M,N); 

Note: When declaring the matrix on the GPU (Dmatrix command) the data is also allocated on the GPU.

Accordingly, for the vector addition example, the following declaration lines should be used:

Dmatrix<int> a(N,1);

Dmatrix<int> b(N,1);

Dmatrix<int> c(N,1);


Introduction by examples: Initialize the vectors a and b

In the previous codes, the vectors a and b have been initialized on the CPU within the for loop.
By using the library, a more compact initialization procedure can be adopted, and it will be automatically accomplished on the GPU or on the CPU according to the way the involved variables are declared.
In other words, if the vectors a and b are declared as Dmatrix, the initialization (see the following lines) is performed on the GPU, while if the variables are declared as Hmatrix, the CPU is considered.
To initialize the vector a, the EqSpace command is used:


This command generates a vector of K equally spaced elements from start to stop.
As for the declaration we should define the kind of the variable (int, double,…) into the field type.
And so the initialization is obtained in only two lines (same command for CPU or GPU):

a=EqSpace<int>(0,-(N-1),N);             //a(i)=-i

b=EqSpace<int>(0,N-1,N)*EqSpace<int>(0,N-1,N); //b(i)=i*i

When using Dmatrix, the data are initialized directly on the GPU (without using the cudaMemcpy routine).

Introduction by examples: Addition

One line (for CPU as well as for GPU):


Nothing more easy!


Introduction by examples: Streams

In the downloadable files you may find a transposition of the CUDA SDK simpleStreams Example in the syntax of the BlueBird library.
The aim is to show how the use of streams and pinned memory allow to concurrently execute kernels and memory transfers.
The results on a GeForce GT 540M are the following:

Memcopy async only (1 stream) 11.84ms
Kernel only (1 stream) 11.19ms
Kernel with memory transfers (1 stream) 23.04ms
Kernel without memory transfers (4 streams) 12.08ms
Kernel with memory transfers (4 streams) – approach #1 12.09ms
Kernel with memory transfers (4 streams) – approach #2 12.30ms

These results show that the overall execution time with or without memory transfers is the same.
The overlap between memory transfers and execution is visible from the following timeline produced by the Visual Profiler.


Final notes

Printing GPU matrix elements is as easy as doing it with CPU matrix elements.
You can directly access GPU or CPU matrix elements by the a(i) syntax (see the use of the cout in the Bluebird Library example code). Alternatively, you can access sub-vectors or sub-matrices as

cout << a(Range(0,9)) << endl;

which will show the elements of a from #0 to #9.