FAQ | Overview | Download | Publications | Code Examples | Applications | Features | License | Ongoing Work | Contact | Acknowledgments |
Skeleton programming is an approach where an application is written
with the help of "skeletons". A skeleton is a pre-defined, generic component
such as map, reduce, scan, farm, pipeline etc.
that implements a common specific pattern of computation and data dependence, and
that can be customized with (sequential) user-defined code parameters.
Skeletons provide a high degree of abstraction and portability
with a quasi-sequential
programming interface, as their implementations encapsulate all
low-level and platform-specific details such as
parallelization, synchronization, communication, memory management,
accelerator usage and other optimizations.
SkePU is an open-source skeleton programming framework for multicore CPUs
and multi-GPU systems.
It is a C++ template library with six data-parallel
and one task-parallel skeletons, two generic container types, and
support for execution on multi-GPU systems both with CUDA and OpenCL.
NEW: SkePU User Guide, May 2015 (revised July 2015)
SkePU Poster (PDF, 1 page), previous version presented at HiPEAC-2012
Main features:
Supported skeletons: (some illustrations are given on the right for 1D operands)
Multiple back-ends and multi-GPU support: Each skeleton has several different backends (implementations): for sequential C, OpenMP, OpenCL, CUDA, and multi-GPU OpenCL and CUDA. There is also an experimental back-end for MPI (which is not part of the public distribution below).
Tunable context-aware implementation selection: Depending on a skeleton call's execution context properties (usually, operand size), different back-ends (skeleton implementations) for different execution unit types will be more efficient than others. SkePU provides a tuning framework that allows it to automatically select the expected fastest implementation variant at each skeleton call.
Smart containers (Vector, Matrix) for passing operands to skeleton calls: Smart containers are runtime data structures wrapping operand data that control software caching of operand elements by automatically keeping track of the valid copies of their element data and their memory locations. Smart containers can, at runtime, automatically optimize communication, perform memory management, and synchronize asynchronous skeleton calls driven by operand data flow.
Hybrid execution (CPU, GPU) of skeleton calls using an overpartitioning approach with dynamic heterogeneous scheduling (in the StarPU-integrated version)
SkePU comes in two different distributions:
(1) as a stand-alone version
which includes an off-line autotuning framework preparing for context-aware dynamic
selection of the expected fastest implementation variant at each skeleton call,
and
(2) as a version that is integrated with the
StarPU runtime system.
The latter version provides support for hybrid CPU+GPU execution,
performance-aware dynamic scheduling and load balancing.
Source code for SkePU 1.x is available on request
Major new features:
See also the html documentation generated by doxygen.
See also the html documentation generated by doxygen.
Johan Enmyren and Christoph W. Kessler.
SkePU: A multi-backend skeleton programming library for multi-GPU systems.
In Proc. 4th Int. Workshop on High-Level Parallel Programming and Applications (HLPP-2010), Baltimore, Maryland, USA. ACM, September 2010
(PDF)
Johan Enmyren, Usman Dastgeer and Christoph Kessler.
Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems.
Proc. MCC-2010 Third Swedish Workshop on Multicore Computing, Gothenburg, Sweden, Nov. 2010.
Usman Dastgeer and Christoph Kessler.
Flexible Runtime Support for Efficient Skeleton Programming
on Heterogeneous GPU-based Systems
Proc. ParCo 2011: International Conference on Parallel Computing,
Ghent, Belgium, 2011.
(PDF)
Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards,
Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Träff and Sabri Pllana.
Programmability and Performance Portability Aspects of Heterogeneous
Multi-/Manycore Systems.
Proc. DATE-2012 conference on Design Automation and Testing in Europe,
Dresden, March 2012.
(PDF (author version),
PDF at IEEE Xplore)
Usman Dastgeer and Christoph Kessler.
A performance-portable generic component for 2D convolution computations
on GPU-based systems.
Proc. MULTIPROG-2012 Workshop at HiPEAC-2012, Paris, Jan. 2012.
(PDF)
Usman Dastgeer, Lu Li, Christoph Kessler.
Adaptive implementation selection in the SkePU skeleton programming library.
Proc. 2013 Biennial Conference on Advanced Parallel Processing Technology
(APPT-2013), Stockholm, Sweden, Aug. 2013.
(PDF)
Mudassar Majeed, Usman Dastgeer, Christoph Kessler.
Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters.
Proc. Int. Conf. on Parallel and Distr. Processing Techniques and Applications (PDPTA-2013), Las Vegas, USA, July 2013.
(PDF)
Usman Dastgeer.
Performance-Aware Component Composition for GPU-based Systems.
PhD thesis, Linköping Studies in Science and Technology, Dissertation
No. 1581, Linköping University, May 2014.
(LiU-EP)
Christoph Kessler, Usman Dastgeer and Lu Li.
Optimized Composition: Generating Efficient Code for Heterogeneous Systems
from Multi-Variant Components, Skeletons and Containers.
In: F. Hannig and J. Teich (eds.), Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany, pp. 43-48. (PDF)
Usman Dastgeer and Christoph Kessler.
Smart Containers and Skeleton Programming for GPU-based Systems.
Proc. of the 7th Int. Symposium on High-level Parallel Programming and Applications (HLPP'14), Amsterdam, July 2014. (PDF slides)
Usman Dastgeer and Christoph Kessler.
Smart Containers and Skeleton Programming for GPU-based Systems.
Int. Journal of Parallel Programming
44(3):506-530, June 2016 (online: March 2015), Springer.
DOI: 10.1007/s10766-015-0357-6.
Oskar Sjöström, Soon-Heum Ko, Usman Dastgeer,
Lu Li, Christoph Kessler:
Portable Parallelization of the EDGE CFD Application for GPU-based Systems using the SkePU Skeleton Programming Library.
Proc. ParCo-2015 conference, Edinburgh, UK, 1-4 Sep. 2015.
Published in: Gerhard R. Joubert, Hugh Leather, Mark Parsons, Frans Peters, Mark Sawyer (eds.):
Advances in Parallel Computing, Volume 27: Parallel Computing: On the Road to Exascale,
IOS Press, April 2016, pages 135-144.
DOI 10.3233/978-1-61499-621-7-135.
Sebastian Thorarensen, Rosandra Cuello, Christoph Kessler,
Lu Li and Brendan Barry:
Efficient Execution of SkePU Skeleton Programs on the
Low-power Multicore Processor Myriad2.
Proc. 24th Euromicro International Conference on Parallel, Distributed,
and Network-Based Processing (PDP 2016),
pages 398-402, IEEE, Feb. 2016.
DOI: 10.1109/PDP.2016.123.
August Ernstsson, Lu Li, Christoph Kessler:
SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems.
Accepted for HLPP-2016, Münster, Germany, 4-5 July 2016.
Christoph W. Kessler, Sergei Gorlatch, Johan Enmyren, Usman Dastgeer, Michel Steuwer, Philipp Kegel.
Skeleton Programming for Portable Many-Core Computing.
Book Chapter, 20 pages, in: S. Pllana and F. Xhafa, eds.,
Programming Multi-Core and Many-Core Computing Systems,
Wiley Interscience, New York, accepted 2011, to appear, 2016 (?)
SkePU is a C++ template library targeting GPU-based systems that provides a higher abstraction level to the application programmer. The SkePU code is concise, elegant and efficient. In the following example, a dot product calculation of two input vectors is shown using the MapReduce skeleton available in the SkePU library. In the code, a MapReduce skeleton instance (dotProduct
) is created which maps two input vectors with mult_f
and then reduces the result with plus_f
,
thus instantiating a dot product function.
Behind the scene, the computation can either run on a sequential CPU,
multi-core CPUs, single or multiple GPUs based on the execution configuration
and platform used for the execution.
#include < iostream > #include "skepu/vector.h" #include "skepu/mapreduce.h" BINARY_FUNC(plus_f, double, a, b, return a+b; ) BINARY_FUNC(mult_f, double, a, b, return a*b; ) int main() { skepu::MapReduce <mult_f, plus_f> dotProduct(new mult_f, new plus_f); skepu::Vector <double> v1(500,4); skepu::Vector <double> v2(500,2); double r = dotProduct(v1,v2); std::cout <<"Result: " << r << "\n"; return 0; } // Output // Result: 4000
Several test applications have been developed using the SkePU skeletons, including
This section lists some results of tests conducted with several applications, highlighting different performance-related aspects. For a comprehensive description and evaluation of SkePU features and implementation, see e.g. Chapters 3 and 4 of Usman Dastgeers PhD thesis.
SkePU supports multi-GPU executions using CUDA and OpenCL,
even with early CUDA versions.
With CUDA before version 4.0, the multi-GPU execution was
rather inefficient due to the threading overhead on the host side.
With CUDA 4.0 and later, it is possible to use all GPUs in the system concurrently
from a single host (CPU) thread. This makes multi-GPU executions using CUDA
a viable option for many applications.
In SkePU, support for multi-GPU executions is changed when using CUDA 4.0
to use single host thread. The diagram below compares execution on one
and two GPUs with SkePU multi-GPU execution using CUDA 4.0.
The stand-alone version of SkePU applies an off-line (deployment-time) machine learning technique to predict at runtime the fastest back-end for a skeleton call, depending on the call context (in particular, operand sizes). This includes the choice between a CPU implementation (sequential or with multiple OpenMP threads), or between CUDA resp. OpenCL with one or several GPUs. Even the number of GPU threads and thread blocks can be tuned. See the documentation and publications for further details.
The StarPU-integrated version of SkePU instead delegates the back-end selection to the StarPU runtime system's built-in dynamic performance modeling and tuning mechanism.
When using SkePU with StarPU support, SkePU can split the work of one or more skeleton executions into multiple tasks and use multiple computing devices (CPUs/GPUs) in an efficient way. The diagram below shows a Coulombic potential grid application execution on a hybrid platform (CPUs and GPU) for different matrix sizes.
The diagram to the right shows a breakdown of the execution time for different SkePU single-skeleton computations coded in OpenCL, showing PCIe memory transfer and kernel computation times. It highlights the overhead for transferring data to (red) and from (green) GPUs in relation to the kernel's computational work (blue).
In order to eliminate unnecessary data transfers across multiple subsequent skeleton calls, SkePU provides "smart containers" for passing operands in generic STL-like data structures such as Vector and Matrix. A smart container internally performs software caching of recently accessed elements in various device memories and reusing them in subsequent calls on the same device where applicable, resulting in a run-time optimization of operand communication. In particular, it implements a "lazy memory copying" technique, transferring written elements back from device memory only if accessed by the CPU after the call.
The smart container concept and implementation has been revised in SkePU v1.1 (2014) compared to earlier versions. The Vector and Matrix smart containers in SkePU v1.1 internally implement a variant of the MSI coherence protocol, providing sequential consistency (thus no need to explicitly flush() device copies any more in a multi-GPU scenario). For multi-GPU usage, it also supports direct GPU-GPU transfer of coherence messages where available, and uses lazy deallocation of device copies to reduce memory management overhead.
Smart containers can lead to a significant performance gain over "normal" containers, especially for iterative computations such as Nbody simulation or SPH where we observed speedups of up to 3 orders of magnitude by using smart containers instead of naive operand data transfer before and after each kernel invocation. Some speedup results for a system with 2 GPUs, averaged over many runs with different problem sizes, are shown below to the right. For the details, see the documentation and publications.
The diagram below shows the performance of the dotProduct example computed using different backends, highlighting performance differences between different back-ends. Moreover, it compares with CUBLAS, as the Multi_OpenCL MapReduce implementation of SkePU outperforms all others in this case.
For more results, see the publications section.
SkePU is licensed under the GNU General Public License as published by the Free Software Foundation (version 3 or later). For more information, please see the license file included in the downloadable source code.
SkePU is a work in progress. Future work includes adding support for more skeletons and containers, e.g. for sparse matrix operations; for further task-parallel skeletons; and for other types of target architectures. For instance, there exists an experimental prototype with MPI backends, allowing to run SkePU programs on multiple nodes of a MPI cluster without modifying the source code (not included in the above public distribution).
If you would like to contribute, please let us know.
For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".
This work was partly funded by the EU FP7 projects PEPPHER and EXCESS, and by SeRC project OpCoReS.
Previous major contributors to SkePU include
Johan Enmyren and Usman Dastgeer.
The multivector container for MapArray, and the user guide have been added by
Oskar Sjöström.
Streaming support for CUDA in SkePU 1.2
has been contributed by Claudio Parisi from University of Pisa, Italy,
in a recent cooperation with the FastFlow project.
SkePU example programs in the public distribution
have been contributed by Johan Enmyren, Usman Dastgeer,
Mudassar Majeed and Lukas Gillsjö.
Experimental implementations of SkePU for other target platforms
(not part of the public distribution) have been
contributed e.g. by Mudassar Majeed, Rosandra Cuello and Sebastian Thorarensen.