Introduction · Downloads · Example · Skeletons · Applications · Publications · Contact · Acknowledgements · SkePU 1

Autotunable Multi-Backend Skeleton Programming Framework for Multicore CPU and Multi-GPU Systems

SkePU

New: Preview release of SkePU 2, with a flexible and type-safe C++11-based API and a new compilation model

Introduction

Skeleton programming is an approach where an application is written with the help of skeletons. A skeleton is a pre-defined, generic component such as map, reduce, scan, farm, pipeline etc. that implements a common specific pattern of computation and data dependence, and that can be customized with (sequential) user-defined code parameters. Skeletons provide a high degree of abstraction and portability with a quasi-sequential programming interface, as their implementations encapsulate all low-level and platform-specific details such as parallelization, synchronization, communication, memory management, accelerator usage and other optimizations.

SkePU is an open-source skeleton programming framework for multicore CPUs and multi-GPU systems. It is a C++ template library with six data-parallel and one task-parallel skeletons, two generic container types, and support for execution on multi-GPU systems both with CUDA and OpenCL.

Main features

Supported skeletons

  • Map: Data-parallel element-wise application of a function with arbitrary arity.
  • Reduce: Reduction with 1D and 2D variations.
  • MapReduce: Efficient nesting of Map and Reduce.
  • MapOverlap: Stencil operation in 1D and 2D.
  • Scan: Generalized prefix sum.
  • Call: Generic multi-variant component.

Multiple back-ends and multi-GPU support

Each skeleton has several different backends (implementations): for sequential C, OpenMP, OpenCL, CUDA, and multi-GPU OpenCL and CUDA. There is also an experimental back-end for MPI (which is not part of the public distribution below).

Tunable context-aware implementation selection

Depending on a skeleton call's execution context properties (usually, operand size), different backends (skeleton implementations) for different execution unit types will be more efficient than others. SkePU provides a tuning framework that allows it to automatically select the expected fastest implementation variant at each skeleton call.

Smart containers (Vector, Matrix)

For passing operands to skeleton calls: Smart containers are runtime data structures wrapping operand data that control software caching of operand elements by automatically keeping track of the valid copies of their element data and their memory locations. Smart containers can, at runtime, automatically optimize communication, perform memory management, and synchronize asynchronous skeleton calls driven by operand data flow.

Changes

SkePU 2 is a rewrite of SkePU, with a new user interface based on modern C++ syntax, such as lambda expressions; a new compilation model with a source-to-source translator using Clang libraries; a revised implementation and runtime built around variadic templates; and the same efficient parallel skeleton algorithms as before. The design of SkePU 2 emphasizes flexibility and type-safety, areas in which earlier work on SkePU 1.x had identified opportunities for improvement.

SkePU 2 removes the MapArray and Generate skeletons in favor of a generalized Map and adds the Call skeleton. It is not backwards-compatible with programs written for SkePU 1.x. However, translating existing source-code is straightforward as each skeleton in SkePU 1.x has a direct—or simpler—equivalent in SkePU 2.

More information on older SkePU versions is available here.

Downloads

source code
Source Code
user guide
User Guide
presentation
Presentation

(Previous versions)

Software License

SkePU is licensed under the GNU General Public License as published by the Free Software Foundation (version 3 or later). For more information, please see the license file included in the downloadable source code.

System Requirements

SkePU requires a target compiler compatible with the C++11 standard (or later). Additionally, a system able to both compile and run the LLVM system, including Clang, is required (this needs not to be the target system).

SkePU is tested with GCC 4.9+ on Linux and macOS, Apple LLVM (Clang) 8.0 on macOS.

The following optional features require special compiler support:

  • CUDA backend requires CUDA (version 7 or later) compiler and runtime, and hardware support.
  • OpenCL backend requires an OpenCL runtime and hardware support.
  • OpenMP backend requires an OpenMP-enabled compiler (e.g., GCC).

Example

This example demonstrates SkePU with a program computing the PPMCC coefficient of two vectors. It uses three skeleton instances: sum, sumSquare, and dotProduct and illustrates multiple ways of defining user functions. SkePU's smart containers are well suited to calling multiple skeletons in sequence as done here.

#include <iostream>
#include <cmath>
#include <skepu2.hpp>

// Unary user function
float square(float a)
{
	return a * a;
}

// Binary user function
float mult(float a, float b)
{
	return a * b;
}

// User function template
template<typename T>
T plus(T a, T b)
{
	return a + b;
}

// Function computing PPMCC
float ppmcc(skepu2::Vector<float> &x, skepu2::Vector<float> &y)
{
	// Instance of Reduce skeleton 
	auto sum = skepu2::Reduce(plus<float>);
	
	// Instance of MapReduce skeleton
	auto sumSquare = skepu2::MapReduce<1>(square, plus<float>);
	
	// Instance with lambda syntax
	auto dotProduct = skepu2::MapReduce<2>(
		[] (float a, float b) { return a * b; },
		[] (float a, float b) { return a + b; }
	);
	
	size_t N = x.size();
	float sumX = sum(x);
	float sumY = sum(y);
	
	return (N * dotProduct(x, y) - sumX * sumY)
		/ sqrt((N * sumSquare(x) - pow(sumX, 2)) * (N * sumSquare(y) - pow(sumY, 2)));
}

int main()
{
	const size_t size = 100;
	
	// Vector operands
	skepu2::Vector<float> x(size), y(size);
	x.randomize(1, 3);
	y.randomize(2, 4);
	
	std::cout << "X: " << x << "\n";
	std::cout << "Y: " << y << "\n";
	
	float res = ppmcc(x, y);
	
	std::cout << "res: " << res << "\n";
	
	return 0;
}

Skeletons

SkePU contains the skeletons Map, Reduce, MapReduce, MapOverlap, Scan, and Call. They are presented separately below, with a short program snippet exemplifying their usage. In practice, repeatedly calling skeletons and using multiple types of skeletons is encouraged, to fully utilize the advantages of SkePU such as the smart containers.

Map

The Map skeleton represents a flat, data-parallel computation without dependencies. Each sub-computation is an application of the user function supplied to the Map. It works on SkePU containers by, element for element, applying the user function (sum in the example below) multiple times. Any number of equally sized containers can be supplied as arguments; Map will pass one element from each. For more general computations, it is also possible to supply extra arguments as needed which can be accessed freely inside the user function. These can be containers or scalar values.

Map can also be used without element-wise arguments, with the number of sub-computations controlled by the size of the output container only . Programs can take advantage of this, and the fact that the user function can be made aware of the index of the currently processed element, to create interesting use-cases. See, for example, the Mandelbrot program in the distribution.

float sum(float a, float b)
{
	return a + b;
}

Vector<float> vector_sum(Vector<float> &v1, Vector<float> &v2)
{
	auto vsum = Map<2>(sum);
	Vector<float> result(v1.size());
	return vsum(result, v1, v2);
}

Reduce

Reduce performs an associative reduction. Two modes are available: 1D reduction and 2D reduction. An instance of the former type accepts a vector or a matrix, producing a scalar or a vector respectively as the result; the latter only works on matrices. For matrix reductions, the primary direction of the reduction (row-first or column-first) can be controlled.

float min_f(float a, float b)
{
	return (a < b) ? a : b;
}

float min_element(Vector<float> &v)
{
	auto min_calc = Reduce(min_f);
	return min_calc(v);
}

MapReduce

MapReduce is a sequence of Map and Reduce and offers the features of both, with some limitations, for example only 1D reductions are supported. An instance is created from two user functions, one for mapping and one for reducing. The reduce function should be associative.

As with Map, the possbilities of zero element-wise arity and element-index awarneness in the user function create interesting possibilities. The Taylor program in the distribution uses this to perform Taylor series approximations with zero data movement overhead. A more typical dot product computation is used in the example below.

float add(float a, float b)
{
	return a + b;
}

float mult(float a, float b)
{
	return a * b;
}

float dot_product(Vector<float> &v1, Vector<float> &v2)
{
	auto dotprod = MapReduce<2>(mult, add);
	return dotprod(v1, v2);
}

Scan

Scan performs a generalized prefix sum operation, either inclusive or exclusive.

float max_f(float a, float b)
{
	return (a > b) ? a : b;
}

Vector<float> partial_max(Vector<float> &v)
{
	auto premax = Scan(max_f);
	Vector<float> result(v.size());
	return premax(result, v);
}

MapOverlap

MapOverlap represents a stencil operation. It is similar to Map, but instead of a single element, a region of elements is available in the user function. The region is passed as a pointer to the center element, so manual pointer arithmetic is required to access the data.

A MapOverlap can either be one-dimensional, working on vectors or matrices (separable computations only) or two-dimensional for matrices. The type is set per-instance and deduced from the user function. Different edge handling modes are available, such as overlapping or clamping.

float conv(int overlap, size_t stride, const float *v, const Vec<float> stencil, float scale)
{
	float res = 0;
	for (int i = -overlap; i <= overlap; ++i)
		res += stencil[i + overlap] * v[i*stride];
	return res / scale;
}

Vector<float> convolution(Vector<float> &v)
{
	auto convol = MapOverlap(conv);
	Vector<float> stencil {1, 2, 4, 2, 1};
	Vector<float> result(v.size());
	convol.setOverlap(2);
	return convol(result, v, stencil, 10);
}

Call

Details coming soon

Applications

Several test applications have been developed using the SkePU skeletons, including:

The source code for some of these is included in the SkePU distribution.

Publications

2010 - 2011 - 2012 - 2013 - 2014 - 2015 - 2016 - In press

2010

  • Johan Enmyren and Christoph W. Kessler.
    SkePU: A multi-backend skeleton programming library for multi-GPU systems.
    In Proc. 4th Int. Workshop on High-Level Parallel Programming and Applications (HLPP-2010), Baltimore, Maryland, USA. ACM, September 2010 (PDF)
  • Johan Enmyren, Usman Dastgeer and Christoph Kessler.
    Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems.
    Proc. MCC-2010 Third Swedish Workshop on Multicore Computing, Gothenburg, Sweden, Nov. 2010.

2011

  • Usman Dastgeer, Johan Enmyren, and Christoph Kessler.
    Auto-tuning SkePU: A Multi-Backend Skeleton Programming Framework for Multi-GPU Systems.
    Proc. IWMSE-2011, Hawaii, USA, May 2011, ACM (ACM DL)
    A previous version of this article was also presented at: Proc. Fourth Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG-2011), January 23, 2011, in conjunction with HiPEAC-2011 conference, Heraklion, Greece.
  • Usman Dastgeer and Christoph Kessler.
    Flexible Runtime Support for Efficient Skeleton Programming on Heterogeneous GPU-based Systems
    Proc. ParCo 2011: International Conference on Parallel Computing, Ghent, Belgium, 2011. (PDF)
  • Usman Dastgeer.
    Skeleton Programming for Heterogeneous GPU-based Systems.
    Licentiate thesis. Thesis No. 1504, Department of Computer and Information Science, Linköping University, October, 2011 (LIU EP)

2012

  • Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Träff and Sabri Pllana.
    Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems.
    Proc. DATE-2012 conference on Design Automation and Testing in Europe, Dresden, March 2012. (PDF (author version), PDF at IEEE Xplore)
  • Usman Dastgeer and Christoph Kessler.
    A performance-portable generic component for 2D convolution computations on GPU-based systems.
    Proc. MULTIPROG-2012 Workshop at HiPEAC-2012, Paris, Jan. 2012. (PDF)

2013

  • Usman Dastgeer, Lu Li, Christoph Kessler.
    Adaptive implementation selection in the SkePU skeleton programming library.
    Proc. 2013 Biennial Conference on Advanced Parallel Processing Technology (APPT-2013), Stockholm, Sweden, Aug. 2013. (PDF)
  • Mudassar Majeed, Usman Dastgeer, Christoph Kessler.
    Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters.
    Proc. Int. Conf. on Parallel and Distr. Processing Techniques and Applications (PDPTA-2013), Las Vegas, USA, July 2013. (PDF)

2014

  • Usman Dastgeer.
    Performance-Aware Component Composition for GPU-based Systems.
    PhD thesis, Linköping Studies in Science and Technology, Dissertation No. 1581, Linköping University, May 2014. (LiU-EP)
  • Christoph Kessler, Usman Dastgeer and Lu Li.
    Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers.
    In: F. Hannig and J. Teich (eds.), Proc. First Workshop on Resource awareness and adaptivity in multi-core computing (Racing 2014), May 2014, Paderborn, Germany, pp. 43-48. (PDF)
  • Usman Dastgeer and Christoph Kessler.
    Smart Containers and Skeleton Programming for GPU-based Systems.
    Proc. of the 7th Int. Symposium on High-level Parallel Programming and Applications (HLPP'14), Amsterdam, July 2014. (PDF slides)

2015

  • Usman Dastgeer and Christoph Kessler.
    Smart Containers and Skeleton Programming for GPU-based Systems.
    Int. Journal of Parallel Programming 44(3):506-530, June 2016 (online: March 2015), Springer.DOI: 10.1007/s10766-015-0357-6.

2016

  • Oskar Sjöström, Soon-Heum Ko, Usman Dastgeer, Lu Li, Christoph Kessler:
    Portable Parallelization of the EDGE CFD Application for GPU-based Systems using the SkePU Skeleton Programming Library.
    Proc. ParCo-2015 conference, Edinburgh, UK, 1-4 Sep. 2015.
    Published in: Gerhard R. Joubert, Hugh Leather, Mark Parsons, Frans Peters, Mark Sawyer (eds.): Advances in Parallel Computing, Volume 27: Parallel Computing: On the Road to Exascale, IOS Press, April 2016, pages 135-144. DOI 10.3233/978-1-61499-621-7-135.
  • Sebastian Thorarensen, Rosandra Cuello, Christoph Kessler, Lu Li and Brendan Barry:
    Efficient Execution of SkePU Skeleton Programs on the Low-power Multicore Processor Myriad2.
    Proc. 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016), pages 398-402, IEEE, Feb. 2016. DOI: 10.1109/PDP.2016.123.
  • August Ernstsson, Lu Li, Christoph Kessler:
    SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems.
    Accepted for HLPP-2016, Münster, Germany, 4-5 July 2016.

In press

  • Christoph W. Kessler, Sergei Gorlatch, Johan Enmyren, Usman Dastgeer, Michel Steuwer, Philipp Kegel.
    Skeleton Programming for Portable Many-Core Computing.
    Book Chapter, 20 pages, in: S. Pllana and F. Xhafa, eds., Programming Multi-Core and Many-Core Computing Systems, Wiley Interscience, New York, accepted 2011, to appear, 2016 (?)

Ongoing work

SkePU is a work in progress. Future work includes adding support for more skeletons and containers, e.g. for sparse matrix operations; for further task-parallel skeletons; and for other types of target architectures. For instance, there exists an experimental prototype with MPI backends, allowing to run SkePU programs on multiple nodes of a MPI cluster without modifying the source code (not included in the above public distribution).

If you would like to contribute, please let us know.

Related projects

  • MeterPU: Generic portable measurement framework for multicore CPU and multi-GPU systems

Contact

For reporting feature suggestions, bug reports, and other comments: please email to "<firstname> DOT <lastname> AT liu DOT se".

Acknowledgments

This work was partly funded by the EU FP7 projects PEPPHER and EXCESS, by SeRC project OpCoReS, and by CUGS.

Previous major contributors to SkePU include Johan Enmyren and Usman Dastgeer. The multivector container for MapArray (note: obsolete in SkePU 2), and the SkePU 1.2 user guide have been added by Oskar Sjöström.

Streaming support for CUDA in SkePU 1.2 has been contributed by Claudio Parisi from University of Pisa, Italy, in a recent cooperation with the FastFlow project.

SkePU example programs in the public distribution have been contributed by Johan Enmyren, Usman Dastgeer, Mudassar Majeed and Lukas Gillsjö.

Experimental implementations of SkePU for other target platforms (not part of the public distribution) have been contributed e.g. by Mudassar Majeed, Rosandra Cuello and Sebastian Thorarensen.

European Commission EU FP7 PEPPHER EXCESS SERC CUGS

Introduction · Downloads · Example · Skeletons · Applications · Publications · Contact · Acknowledgements · SkePU 1