Skeleton programming is an approach where an application is written with the help of "skeletons". A skeleton models common computation and coordination patterns of parallel applications as a pre-defined generic component that can be parameterized with user computations.
SkePU is such a skeleton programming framework for multicore CPUs and multi-GPU systems. It is a C++ template library with six data-parallel and one task-parallel skeletons, two container types, and support for execution on multi-GPU systems both with CUDA and OpenCL. Recently, support for hybrid execution, performance-aware dynamic scheduling and load balancing is developed in SkePU by implementing a backend for the StarPU runtime system.
Source code and documentation
Current release
- Orinigal SkePU: Version 1.0 (Last Updated=04/08/2012): Available for
download by the following link:
download source-code.
It contains seven data-parallel skeletons for vector, dense and sparse matrix operands with efficient multi-GPUs support on CUDA and OpenCL using single host thread. Contains several enahncements in comparison to the previous release. See 'CHANGES' file for more information. html documentation, generated from doxygen.
- SkePU with StarPU integration: Version 0.8 (Last Updated=06/11/2012): Available for
download by the following link:
download source-code.
It contains seven data-parallel and one task-parallel (farm) skeletons for vector and matrix operands, multi-GPUs and hybrid execution support. Contains several enahncements in comparison to the previous release. See 'CHANGES' file for more information. html documentation, generated from doxygen. Tested with (StarPU 1.0.4, CUDA/nvcc 4.2, GCC 4.7.1)
Previous releases
- Version 0.6: The first public release of SkePU. Available for
download by the following link:
download source-code.
It contains seven data-parallel skeletons for vector operands with multi-GPUs support. See html documentation, generated from doxygen.
- Version 0.7: Available for
download by the following link:
download source-code.
It contains seven data-parallel skeletons for vector and (partial) matrix operands with efficient multi-GPUs support on CUDA using single host thread. See html documentation, generated from doxygen.
- Version 0.7 with StarPU integration: The SkePU integrated with StarPU can be
download by the following link:
download source-code.
It contains seven data-parallel, one task-parallel (farm) skeletons for vector and matrix operands, multi-GPUs and hybrid execution support . See html documentation, generated from doxygen.
2010
- Johan Enmyren and Christoph W. Kessler. SkePU: A multi-backend skeleton programming library for multi-GPU systems. In Proc. 4th Int. Workshop on High-Level Parallel Programming and Applications (HLPP-2010), Baltimore, Maryland, USA. ACM, September 2010 (pdf)
- Johan Enmyren, Usman Dastgeer and Christoph Kessler.
Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems.
Proc. MCC-2010 Third Swedish Workshop on Multicore Computing, Gothenburg, Sweden, Nov. 2010.
2011
- Usman Dastgeer, Johan Enmyren, and Christoph Kessler.
Auto-tuning SkePU: A Multi-Backend Skeleton Programming Framework for Multi-GPU Systems.
Proc. IWMSE-2011, Hawaii, USA, May 2011, ACM (pdf).
A previous version of this article was also presented at:
Proc. Fourth Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2011), January 23, 2011, in conjunction with HiPEAC-2011 conference, Heraklion, Greece.
-
Usman Dastgeer and Christoph Kessler. Flexible Runtime Support for Efficient Skeleton Programming
on Heterogeneous GPU-based Systems ParCo 2011: International Conference on Parallel Computing. Ghent, Belgium, 2011.
-
Christoph W. Kessler, Sergei Gorlatch, Johan Enmyren, Usman Dastgeer, Michel Steuwer, Philipp Kegel.
Skeleton Programming for Portable Many-Core Computing.
Book Chapter, 20 pages, in: S. Pllana and F. Xhafa, eds., Programming Multi-Core and Many-Core Computing Systems, Wiley Interscience, New York, to appear in autumn 2011.
-
Usman Dastgeer. Skeleton Programming for Heterogeneous GPU-based Systems. Licentiate thesis. Thesis No 1504. Department of Computer and Information Science, Linköping University, October, 2011 (LIU EP)
2012
-
Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Träff and Sabri Pllana. Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems. Proc. DATE-2012 conference on Design Automation and Testing in Europe, Dresden, March 2012.
-
Usman Dastgeer and Christoph Kessler. A performance-portable generic component for 2D convolution computations on GPU-based systems. Proc. MULTIPROG-2012 Workshop at HiPEAC-2012, Paris, Jan. 2012.
SkePU is a C++ template library that provides higher abstraction level to the application programmer. The SkePU code is consice, elegant and efficient. In the following, a dot product calculation of two input vectors is shown using the MapReduce skeleton available in the SkePU library. In the code, a MapReduce skeleton (dotProduct) is created which maps two input vectors with mult_f and then reduce the result with plus_f, producing the dot product. Behind the scene, the computation can either run on a seuqential CPU, multi-core CPUs, single or multiple GPUs based on the execution configuration and platform used for the execution.
#include < iostream >
#include "skepu/vector.h"
#include "skepu/mapreduce.h"
BINARY_FUNC(plus_f, double, a, b,
return a+b;
)
BINARY_FUNC(mult_f, double, a, b,
return a*b;
)
int main()
{
skepu::MapReduce <mult_f, plus_f> dotProduct(new mult_f,
new plus_f);
skepu::Vector <double> v1(500,4);
skepu::Vector <double> v2(500,2);
double r = dotProduct(v1,v2);
std::cout <<"Result: " << r << "\n";
return 0;
}
// Output
// Result: 4000
Evaluation
Several applications have been developed using the SkePU skeletons, including Runge-Kutta ODE solver, separable and non-separable image convolution filters, Successive Over-Relaxation (SOR), Coulombic potential, and the Mandelbrot fractals. This section lists some results of tests conducted with several applications, highlighting different performance-related aspects
Multi-GPU execution
SkePU already supports multi-GPU executions using CUDA and OpenCL. However, with CUDA before version 4.0, the multi-GPU execution was rather in-efficient due to the threading overhead on the host side. With CUDA 4.0, it is possible to use all GPUs in the system concurrently from a single host (CPU) thread. The makes multi-GPU executions using CUDA a viable option for many applications. In SkePU, support for multi-GPU executions is changed using CUDA 4.0 to use single host thread. Below image compares execution on one and two GPUs with the SkePU multi-GPU execution using CUDA 4.0.
Hybrid execution
While using SkePU with the StarPU support, we can split the work of one or more skeleton executions and use multiple computing devices (CPUs/GPUs) in an efficient way. Below image shows Coulombic potential grid execution on a hybrid platform (CPUs and GPU) for different matrix sizes.
Performance-aware dynamic scheduling
SkePU has support for dynamic performance-aware scheduling when used with the StarPU support. Below table shows execution of Iterative SOR with such performance-aware scheduling policies where average speedups are calculated over different matrix sizes with respect to static scheduling (CUDA execution) policy. The result shows that usage of scheduling policies with data prefetching support yield significant performance gains.
GPU communication overhead
Below image shows time-distribution for different operations (e.g. memory transfers, computation) using OpenCL. It highlights the overhead for transferring data to/from GPUs. To overcome this data-transfer problem, SkePU uses "lazy memory copying" technique (see documentation and publications for detail).
Performance comparison
Below image shows dotProduct example calculated using different backends, highlighting performance differences between different back-ends. Moreover, it compares it with cublas implementation, as Multi_OpenCL MapReduce implementation of SkePU outperforms all others in this case.
For more results, see publications section.
Software License
The SkePU is licensed under the GNU General Public License as published by
the Free Software Foundation (version 3 or later). For more information, please see license file included in downloadable source-code.
Ongoing work
SkePU is a work in progress. Future work include adding support for more skeletons, for dense and possibly sparce matrix operations as well as more task-parallel skeletons.
Contact: Usman Dastgeer, Johan Enmyren, Prof. Christoph Kessler. For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".
This work was partly funded by EU FP7 project PEPPHER.