SkePU Skeleton Programming Framework for Multicore CPU and Multi-GPU Systems

SkePU FAQs

Skeleton programming is an approach where an application is written with the help of "skeletons". A skeleton models common computation and coordination patterns of parallel applications as a pre-defined generic component that can be parameterized with user computations.

SkePU is such a skeleton programming framework for multicore CPUs and multi-GPU systems. It is a C++ template library with six data-parallel and one task-parallel skeletons, two container types, and support for execution on multi-GPU systems both with CUDA and OpenCL. Recently, support for hybrid execution, performance-aware dynamic scheduling and load balancing is developed in SkePU by implementing a backend for the StarPU runtime system.

Source code and documentation

Current release

  1. Orinigal SkePU: Version 1.0 (Last Updated=04/08/2012): Available for download by the following link: download source-code. It contains seven data-parallel skeletons for vector, dense and sparse matrix operands with efficient multi-GPUs support on CUDA and OpenCL using single host thread. Contains several enahncements in comparison to the previous release. See 'CHANGES' file for more information. html documentation, generated from doxygen.
  2. SkePU with StarPU integration: Version 0.8 (Last Updated=06/11/2012): Available for download by the following link: download source-code. It contains seven data-parallel and one task-parallel (farm) skeletons for vector and matrix operands, multi-GPUs and hybrid execution support. Contains several enahncements in comparison to the previous release. See 'CHANGES' file for more information. html documentation, generated from doxygen. Tested with (StarPU 1.0.4, CUDA/nvcc 4.2, GCC 4.7.1)

Previous releases

  1. Version 0.6: The first public release of SkePU. Available for download by the following link: download source-code. It contains seven data-parallel skeletons for vector operands with multi-GPUs support. See html documentation, generated from doxygen.
  2. Version 0.7: Available for download by the following link: download source-code. It contains seven data-parallel skeletons for vector and (partial) matrix operands with efficient multi-GPUs support on CUDA using single host thread. See html documentation, generated from doxygen.
  3. Version 0.7 with StarPU integration: The SkePU integrated with StarPU can be download by the following link: download source-code. It contains seven data-parallel, one task-parallel (farm) skeletons for vector and matrix operands, multi-GPUs and hybrid execution support . See html documentation, generated from doxygen.

Publications

2010

2011

2012

A code example

SkePU is a C++ template library that provides higher abstraction level to the application programmer. The SkePU code is consice, elegant and efficient. In the following, a dot product calculation of two input vectors is shown using the MapReduce skeleton available in the SkePU library. In the code, a MapReduce skeleton (dotProduct) is created which maps two input vectors with mult_f and then reduce the result with plus_f, producing the dot product. Behind the scene, the computation can either run on a seuqential CPU, multi-core CPUs, single or multiple GPUs based on the execution configuration and platform used for the execution.
#include < iostream >

#include "skepu/vector.h"
#include "skepu/mapreduce.h"

BINARY_FUNC(plus_f, double, a, b,
    return a+b;
)

BINARY_FUNC(mult_f, double, a, b,
    return a*b;
)

int main()
{
    skepu::MapReduce <mult_f, plus_f> dotProduct(new mult_f,
                                            new plus_f);

    skepu::Vector <double> v1(500,4);
    skepu::Vector <double> v2(500,2);

    double r = dotProduct(v1,v2);
    
    std::cout <<"Result: " << r << "\n";
    
    return 0;
}
// Output
// Result: 4000

Evaluation

Several applications have been developed using the SkePU skeletons, including Runge-Kutta ODE solver, separable and non-separable image convolution filters, Successive Over-Relaxation (SOR), Coulombic potential, and the Mandelbrot fractals. This section lists some results of tests conducted with several applications, highlighting different performance-related aspects

Multi-GPU execution

SkePU already supports multi-GPU executions using CUDA and OpenCL. However, with CUDA before version 4.0, the multi-GPU execution was rather in-efficient due to the threading overhead on the host side. With CUDA 4.0, it is possible to use all GPUs in the system concurrently from a single host (CPU) thread. The makes multi-GPU executions using CUDA a viable option for many applications. In SkePU, support for multi-GPU executions is changed using CUDA 4.0 to use single host thread. Below image compares execution on one and two GPUs with the SkePU multi-GPU execution using CUDA 4.0.

Hybrid execution: Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes

Hybrid execution

While using SkePU with the StarPU support, we can split the work of one or more skeleton executions and use multiple computing devices (CPUs/GPUs) in an efficient way. Below image shows Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes.

Hybrid execution: Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes

Performance-aware dynamic scheduling

SkePU has support for dynamic performance-aware scheduling when used with the StarPU support. Below table shows execution of Iterative SOR with such performance-aware scheduling policies where average speedups are calculated over different matrix sizes with respect to static scheduling (CUDA execution) policy. The result shows that usage of scheduling policies with data prefetching support yield significant performance gains.

Data-aware scheduling policy

GPU communication overhead

Below image shows time-distribution for different operations (e.g. memory transfers, computation) using OpenCL. It highlights the overhead for transferring data to/from GPUs. To overcome this data-transfer problem, SkePU uses "lazy memory copying" technique (see documentation and publications for detail).

Overhead of memory transfer, in comparison to computations

Performance comparison

Below image shows dotProduct example calculated using different backends, highlighting performance differences between different back-ends. Moreover, it compares it with cublas implementation, as Multi_OpenCL MapReduce implementation of SkePU outperforms all others in this case.

DotProduct using MapReduce SkePU skeleton in comparison to cublas implementation
For more results, see publications section.

Software License

The SkePU is licensed under the GNU General Public License as published by the Free Software Foundation (version 3 or later). For more information, please see license file included in downloadable source-code.

Ongoing work

SkePU is a work in progress. Future work include adding support for more skeletons, for dense and possibly sparce matrix operations as well as more task-parallel skeletons.
Contact: Usman Dastgeer, Johan Enmyren, Prof. Christoph Kessler. For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".
This work was partly funded by EU FP7 project PEPPHER.