Crown Scheduling

Documentation and Examples

Overview

Publications

Code Examples

Exploiting effectively massively parallel architectures is a major challenge that stream programming can help to face. We investigate the problem of generating energy-optimal code for a collection of streaming tasks that include parallelizable or moldable tasks on a generic manycore processor with dynamic discrete frequency scaling. Streaming task collections differ from classical task sets in that all tasks are running concurrently, so that cores typically run several tasks that are scheduled round-robin at user level in a data-driven way. A stream of data flows through the tasks and intermediate results may be forwarded to other tasks like in a pipelined task graph. In this paper we consider crown scheduling, a novel technique for the combined optimization of resource allocation, mapping and discrete voltage/frequency scaling for moldable streaming task collections in order to optimize energy efficiency given a throughput constraint. We first present optimal off-line algorithms for separate and integrated crown scheduling based on integer linear programming (ILP). We make no restricting assumption about speedup behavior. We introduce the fast heuristic Longest Task, Lowest Group (LTLG) as a generalization of the Longest Processing Time (LPT) algorithm to achieve a load-balanced mapping of parallel tasks, and the Height heuristic for crown frequency scaling. We use them in feedback loop heuristics based on binary search and simulated annealing to optimize crown allocation.
Section Overview gives a brief overview of Crown schedulers features. In Section Publications is a list of publications related to Crown Scheduling You can download Drake thank to the link available in Section Download and license information is available in Section Licence. Section Installation gives the procedure to install our crown schedulers. Section Design a Drake streaming application gives code examples to design a Drake streaming application. Section Ongoing work gives our plans for future work with Crown schedulers. You can find our contact information in Section Contact. Finally, we give credits to our funders in Section Acknowledgment

Overview

Main features:

Compute schedule for energy efficiency with discrete frequency scaling under throughput constraints
Polynomial-time heuristics
Takes architectural constraints into account: arbitrary efficiency and power functions, discrete frequency scaling.
C++ framework for crown heuristics
Integrates into Mimer's workflow

Publications

Nicolas Melot, Christoph Kessler, Joerg Keller, Patrick Eitschberger: Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Many-Core Systems. ACM Transactions on Architecture and Code Optimization (TACO), http://dx.doi.org/10.1145/2687653

Download

Download source-code

See also the html documentation generated by doxygen.

Installation

Our crown scheduler package includes implementation based on AMPL and C++ implementation of heuristics. All our crown scheduler implementations require the pelib to work. Consult its documetnation page for more information to install it. Implementations basedon AMPL additionally require a working copy of a linear integer solver with a AMPL frontend (we use AMPL version 20070505 shipped with ILOG 10.100, and Gurobi 6). Define environment variables CC=gcc and CXX=g++ (or replace these values with your favorite C and C++ compilers, respectively, as long as they use the same command line switches) One all dependencies are installed, download the crown scheduler package, extract it and open a terminal in the extraction directory. Then type:

make install

to install crown schedulers. It creates directories $prefix/bin, $prefix/lib, $prefix/lib/pkgconfig and $prefix/include and install binaries, static and dynamic libraries, pkgconfig library information and include files in these directories, respectively. You can type

make install prefix=/some/other/directory

to install it elsewhere. Binaries are build with the library search path including $prefix/lib, so you may need to give a custom prefix location when linking binaries as well.

Examples

This section gives examples to illustrate the functionning of crown schedulers and its heuristics. Crown schedulers simplify the general scheduling problem by reducing possible subset of cores parallel tasks of a streaming application can be mapped to, from O(p^2) to O(p) possible mapping targets. We recursively divide the set of cores of a target execution platform into smaller, non-overlapping sets until sets include only one core; we call "group" any set of cores obtained through this recursive decomposition process. Fig. 1 illustrate this recursive decomposition and gives crown scheduling its name. Every core of a child group also belong to the same set of ancestor larger groups from which the child group is issued.

Fig. 1: The 15 processor groups in a binary crown over 8 processors, and their corresponding processors above.

In crown schedules (schedules computed by a crown scheduler), parallel tasks are mapped to non-overlapping sets of cores. All cores of a set a task in mapped to participate in running the task. Figure 2 shows two examples of schedules obtained thanks to two crown schedulers.

Fig. 2: Two schedules of the same FFT application. Time increase vertically and core 1 to p are laid out horizontally. Darker and more red tasks run at a higher frequency and brighter and more yello tasks run at a lower frequency. On the left, schedule of recursive FFT obtained from a phase-separated crown scheduler. On the right, the same application scheduled with an integrated scheduler.

Crown schedulers run 3 main phases: allocation, mapping and frequency scaling. The allocation phase consists in assigning a number of processors to run tasks to individually divide the workload of tasks. The execution time of a task is the workload of a task divided by the number of cores running it, multiplied by the parallel efficiency of the task for the number of processors assigned to the task (The efficiency function can be arbitrary). The mapping phase consists in the decision on the actual set of cores to run a task that has been assigned the corresponding number of cores in the allocation phase. The mapping phase tries to maximize the balance of total workload assigned to each core. We propose the LTLG (Longest Task, Lowest Group first) mapping heuristic as a generalization of LPT for the allocation phase. Finally, the frequency scaling phase assigns frequencies to tasks mapped to cores in the mapping phase, so that all tasks terminates before a target deadline while running at the lowest possible frequency, thereby saving energy. Crown schedulers can run all allocation, mapping and frequency scaling phases one after each other or all at the same time in an integrated manner (Fig. 3).

Overview of phase-separated vs. integrated crown scheduling. The methods shown by solid boxes are detailed in our TACO 2015 paper.

Integrated crown schedulers, ILP or heuristics, typically produce solution using lower energy than phase-separated crown scheduler; this is particularly visible in Fig. 2. However, they also typically run much more slowly. Our journal paper details the design integrated and phase-separated ILP-based and C++ heuristic implementations of crown schedules, and uses Mimer to provide quantitative comparison between crown schedulers and other non-crown schedulers.

Software License

The source code available above is released under GPLv3 licence. Please contact Nicolas Melot (nicolas.melot@liu.se) if you want to obtain another licence.

Ongoing work

This is a work in progress. Future work include the implementation of generalized crown schedulers, that is platforms are not restricted with a power of 2 number of cores, each levels of the crown may be divided in unequal groups sizes and each level may be divided in any number of smaller groups. This can open more optimization opportunities. Future work also include the minimization of cost of task-to-task communications as well as taking platform constraints into account, such as voltage and frequency islands.

If you would like to contribute, please let us know.

Contact

For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".

Acknowledgments

This work was partly funded by the EU FP7 project EXCESS and by SeRC project OpCoReS.