SkePU Skeleton Programming Framework for Multicore CPU and Multi-GPU Systems

SkePU FAQ

Skeleton programming is an approach where an application is written with the help of "skeletons". A skeleton models common computation and coordination patterns of parallel applications as a pre-defined generic component that can be parameterized with user computations.

SkePU is such a skeleton programming framework for multicore CPUs and multi-GPU systems. It is a C++ template library with six data-parallel and one task-parallel skeletons, two container types, and support for execution on multi-GPU systems both with CUDA and OpenCL. Recently, support for hybrid execution, performance-aware dynamic scheduling and load balancing has been added in SkePU by implementing a backend for the StarPU runtime system.

Source code and documentation

Current release

  1. Standalone SkePU: Version 1.1 (Last Updated=16/05/2014): Available for download by the following link: download source-code. It contains auto-tuning mechanism (see APPT 2013 paper) as well as the new memory management mechanism for Vector and Matrix containers. html documentation, generated from doxygen.
  2. SkePU with StarPU integration: Version 0.8 (Last Updated=06/11/2012): Available for download by the following link: download source-code. It contains seven data-parallel and one task-parallel (farm) skeletons for vector and matrix operands, multi-GPU and hybrid execution support. Contains several enhancements in comparison to the previous release. See 'CHANGES' file for more information. html documentation, generated from doxygen. Tested with (StarPU 1.0.4, CUDA/nvcc 4.2, GCC 4.7.1)

Previous releases

  1. Version 0.6: The first public release of SkePU. Available for download by the following link: download source-code. It contains seven data-parallel skeletons for vector operands with multi-GPU support. See html documentation, generated from doxygen.
  2. Version 0.7: Available for download by the following link: download source-code. It contains seven data-parallel skeletons for vector and (partial) matrix operands with efficient multi-GPU support on CUDA using a single host thread. See html documentation, generated from doxygen.
  3. Version 1.0: Available for download by the following link: download source-code. It contains seven data-parallel skeletons for vector, dense and sparse matrix operands with efficient multi-GPU support on CUDA and OpenCL using a single host thread. Contains several enhancements in comparison to the previous release. See 'CHANGES' file for more information. html documentation, generated from doxygen.
  4. Version 0.7 with StarPU integration: SkePU integrated with StarPU can be download by the following link: download source-code. It contains seven data-parallel, one task-parallel (farm) skeletons for vector and matrix operands, multi-GPU and hybrid execution support . See html documentation, generated from doxygen.

Publications

2010

2011

2012

2013

2014

A code example

SkePU is a C++ template library targeting GPU-based systems that provides a higher abstraction level to the application programmer. The SkePU code is concise, elegant and efficient. In the following example, a dot product calculation of two input vectors is shown using the MapReduce skeleton available in the SkePU library. In the code, a MapReduce skeleton instance (dotProduct) is created which maps two input vectors with mult_f and then reduce the result with plus_f, producing the dot product. Behind the scene, the computation can either run on a sequential CPU, multi-core CPUs, single or multiple GPUs based on the execution configuration and platform used for the execution.
#include < iostream >

#include "skepu/vector.h"
#include "skepu/mapreduce.h"

BINARY_FUNC(plus_f, double, a, b,
    return a+b;
)

BINARY_FUNC(mult_f, double, a, b,
    return a*b;
)

int main()
{
    skepu::MapReduce <mult_f, plus_f> dotProduct(new mult_f, new plus_f);

    skepu::Vector <double> v1(500,4);
    skepu::Vector <double> v2(500,2);

    double r = dotProduct(v1,v2);
    
    std::cout <<"Result: " << r << "\n";
    
    return 0;
}
// Output
// Result: 4000

Evaluation

Several applications have been developed using the SkePU skeletons, including a Runge-Kutta ODE solver, separable and non-separable image convolution filters, Successive Over-Relaxation (SOR), Coulombic potential, NBody, LU Decomposition, and the Mandelbrot fractals. This section lists some results of tests conducted with several applications, highlighting different performance-related aspects

Multi-GPU execution

SkePU already supports multi-GPU executions using CUDA and OpenCL. However, with CUDA before version 4.0, the multi-GPU execution was rather inefficient due to the threading overhead on the host side. With CUDA 4.0, it is possible to use all GPUs in the system concurrently from a single host (CPU) thread. The makes multi-GPU executions using CUDA a viable option for many applications. In SkePU, support for multi-GPU executions is changed when using CUDA 4.0 to use single host thread. The diagram below compares execution on one and two GPUs with SkePU multi-GPU execution using CUDA 4.0.

Hybrid execution: Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes

Hybrid execution

While using SkePU with StarPU support, we can split the work of one or more skeleton executions and use multiple computing devices (CPUs/GPUs) in an efficient way. The diagram below shows Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes.

Hybrid execution: Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes

Performance-aware dynamic scheduling

SkePU has support for dynamic performance-aware scheduling when used with the StarPU support. The table below shows execution of Iterative SOR with such performance-aware scheduling policies where average speedups are calculated over different matrix sizes with respect to static scheduling (CUDA execution) policy. The result shows that usage of scheduling policies with data prefetching support yield significant performance gains.

Data-aware scheduling policy

GPU communication overhead

The diagram below shows the time distribution for different operations (e.g. memory transfers, computation) using OpenCL. It highlights the overhead for transferring data to/from GPUs. To overcome this data-transfer problem, SkePU uses a "lazy memory copying" technique (see documentation and publications for details).

Overhead of memory transfer, in comparison to computations

Performance comparison

The diagram below shows the performance of the dotProduct example calculated using different backends, highlighting performance differences between different back-ends. Moreover, it compares it with CUBLAS, as the Multi_OpenCL MapReduce implementation of SkePU outperforms all others in this case.

DotProduct using MapReduce SkePU skeleton in comparison to cublas implementation
For more results, see the publications section.

Software License

The SkePU is licensed under the GNU General Public License as published by the Free Software Foundation (version 3 or later). For more information, please see the license file included in the downloadable sourcecode.

Ongoing work

SkePU is a work in progress. Future work includes adding support for more skeletons, for dense and possibly sparse matrix operations as well as further task-parallel skeletons.
Contact: Usman Dastgeer, Johan Enmyren, Prof. Christoph Kessler. For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".
This work was partly funded by EU FP7 project PEPPHER.