Autotunable Multi-Backend Skeleton Programming Framework for Multicore CPU and Multi-GPU Systems

New release: SkePU v1.2 (13 May 2015)

FAQ Overview Download Publications Code Examples Applications Features License Ongoing Work Contact Acknowledgments

Skeleton programming is an approach where an application is written with the help of "skeletons". A skeleton is a pre-defined, generic component such as map, reduce, scan, farm, pipeline etc. that implements a common specific pattern of computation and data dependence, and that can be customized with (sequential) user-defined code parameters.
Skeletons provide a high degree of abstraction and portability with a quasi-sequential programming interface, as their implementations encapsulate all low-level and platform-specific details such as parallelization, synchronization, communication, memory management, accelerator usage and other optimizations.

SkePU poster SkePU is an open-source skeleton programming framework for multicore CPUs and multi-GPU systems.
It is a C++ template library with six data-parallel and one task-parallel skeletons, two generic container types, and support for execution on multi-GPU systems both with CUDA and OpenCL.


Main features:

SkePU comes in two different distributions:
(1) as a stand-alone version which includes an off-line autotuning framework preparing for context-aware dynamic selection of the expected fastest implementation variant at each skeleton call, and
(2) as a version that is integrated with the StarPU runtime system. The latter version provides support for hybrid CPU+GPU execution, performance-aware dynamic scheduling and load balancing.

Source Code and Documentation

Current version: SkePU v1.2 (released 13/5/2015, latest patch 25/5/2015)

  1. Standalone SkePU (Last updated 13/5/2015)

    Download source-code

    Major new features:

    • Streaming support for CUDA GPU execution of most skeletons,
    • MultiVector container for passing an arbitrary number of operands to MapArray,
    • new example applications (e.g. Median Filtering),
    • and a number of fixes e.g. in memory management.

    See also the html documentation generated by doxygen.

  2. SkePU with StarPU integration, v0.8.1 (Last updated 13/5/2015)

    Download source-code.

    • Updated and tested for StarPU 1.1.4 (latest stable StarPU version) with CUDA/nvcc 5.0 and gcc 4.8.1.

    See also the html documentation generated by doxygen.

Previous Releases (Archive)








Code Examples

SkePU is a C++ template library targeting GPU-based systems that provides a higher abstraction level to the application programmer. The SkePU code is concise, elegant and efficient. In the following example, a dot product calculation of two input vectors is shown using the MapReduce skeleton available in the SkePU library. In the code, a MapReduce skeleton instance (dotProduct) is created which maps two input vectors with mult_f and then reduces the result with plus_f, thus instantiating a dot product function. Behind the scene, the computation can either run on a sequential CPU, multi-core CPUs, single or multiple GPUs based on the execution configuration and platform used for the execution.

#include < iostream >

#include "skepu/vector.h"
#include "skepu/mapreduce.h"

BINARY_FUNC(plus_f, double, a, b,
    return a+b;

BINARY_FUNC(mult_f, double, a, b,
    return a*b;

int main()
    skepu::MapReduce <mult_f, plus_f> dotProduct(new mult_f, new plus_f);

    skepu::Vector <double> v1(500,4);
    skepu::Vector <double> v2(500,2);

    double r = dotProduct(v1,v2);
    std::cout <<"Result: " << r << "\n";
    return 0;
// Output
// Result: 4000

More SkePU Code Examples


Several test applications have been developed using the SkePU skeletons, including

Their source code is included in the SkePU distribution.

Some Features

This section lists some results of tests conducted with several applications, highlighting different performance-related aspects. For a comprehensive description and evaluation of SkePU features and implementation, see e.g. Chapters 3 and 4 of Usman Dastgeers PhD thesis.

Multi-GPU Execution

SkePU supports multi-GPU executions using CUDA and OpenCL, even with early CUDA versions.
With CUDA before version 4.0, the multi-GPU execution was rather inefficient due to the threading overhead on the host side. With CUDA 4.0 and later, it is possible to use all GPUs in the system concurrently from a single host (CPU) thread. This makes multi-GPU executions using CUDA a viable option for many applications. In SkePU, support for multi-GPU executions is changed when using CUDA 4.0 to use single host thread. The diagram below compares execution on one and two GPUs with SkePU multi-GPU execution using CUDA 4.0.

Hybrid execution: Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes

Auto-tuned context-aware back-end selection for skeleton calls

The stand-alone version of SkePU applies an off-line (deployment-time) machine learning technique to predict at runtime the fastest back-end for a skeleton call, depending on the call context (in particular, operand sizes). This includes the choice between a CPU implementation (sequential or with multiple OpenMP threads), or between CUDA resp. OpenCL with one or several GPUs. Even the number of GPU threads and thread blocks can be tuned. See the documentation and publications for further details.

The StarPU-integrated version of SkePU instead delegates the back-end selection to the StarPU runtime system's built-in dynamic performance modeling and tuning mechanism.

Hybrid Execution

When using SkePU with StarPU support, SkePU can split the work of one or more skeleton executions into multiple tasks and use multiple computing devices (CPUs/GPUs) in an efficient way. The diagram below shows a Coulombic potential grid application execution on a hybrid platform (CPUs and GPU) for different matrix sizes.

Hybrid execution: Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes
Overhead of memory transfer, in comparison to computations

Smart Containers

The diagram to the right shows a breakdown of the execution time for different SkePU single-skeleton computations coded in OpenCL, showing PCIe memory transfer and kernel computation times. It highlights the overhead for transferring data to (red) and from (green) GPUs in relation to the kernel's computational work (blue).

In order to eliminate unnecessary data transfers across multiple subsequent skeleton calls, SkePU provides "smart containers" for passing operands in generic STL-like data structures such as Vector and Matrix. A smart container internally performs software caching of recently accessed elements in various device memories and reusing them in subsequent calls on the same device where applicable, resulting in a run-time optimization of operand communication. In particular, it implements a "lazy memory copying" technique, transferring written elements back from device memory only if accessed by the CPU after the call.

The smart container concept and implementation has been revised in SkePU v1.1 (2014) compared to earlier versions. The Vector and Matrix smart containers in SkePU v1.1 internally implement a variant of the MSI coherence protocol, providing sequential consistency (thus no need to explicitly flush() device copies any more in a multi-GPU scenario). For multi-GPU usage, it also supports direct GPU-GPU transfer of coherence messages where available, and uses lazy deallocation of device copies to reduce memory management overhead.

Smart containers can lead to a significant performance gain over "normal" containers, especially for iterative computations such as Nbody simulation or SPH where we observed speedups of up to 3 orders of magnitude by using smart containers instead of naive operand data transfer before and after each kernel invocation. Some speedup results for a system with 2 GPUs, averaged over many runs with different problem sizes, are shown below to the right. For the details, see the documentation and publications.

Overhead of memory transfer, in comparison to computations

Performance Comparison

The diagram below shows the performance of the dotProduct example computed using different backends, highlighting performance differences between different back-ends. Moreover, it compares with CUBLAS, as the Multi_OpenCL MapReduce implementation of SkePU outperforms all others in this case.

DotProduct using MapReduce SkePU skeleton in comparison to cublas implementation

For more results, see the publications section.

Software License

SkePU is licensed under the GNU General Public License as published by the Free Software Foundation (version 3 or later). For more information, please see the license file included in the downloadable source code.

Ongoing work

SkePU is a work in progress. Future work includes adding support for more skeletons and containers, e.g. for sparse matrix operations; for further task-parallel skeletons; and for other types of target architectures. For instance, there exists an experimental prototype with MPI backends, allowing to run SkePU programs on multiple nodes of a MPI cluster without modifying the source code (not included in the above public distribution).

If you would like to contribute, please let us know.


For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".


This work was partly funded by the EU FP7 projects PEPPHER and EXCESS, and by SeRC project OpCoReS.

Previous major contributors to SkePU include Johan Enmyren and Usman Dastgeer.
Streaming support for CUDA in SkePU 1.2 has been contributed by Claudio Parisi from University of Pisa, Italy, in a recent cooperation with the FastFlow project.
SkePU example programs in the public distribution have been contributed by Johan Enmyren, Usman Dastgeer, Mudassar Majeed and Lukas Gillsjö.
Experimental implementations of SkePU for other target platforms (not part of the public distribution) have been contributed e.g. by Mudassar Majeed and Rosandra Cuello.