Drake

Generic Framework for Portable Specification of Streaming Applications
and Deployment for On-Chip-Pipelined Execution on Manycore Systems

Background and Motivation

Main Features

Publications

Software license

Download

Installation

Design a Drake Streaming Application

Compiling a Drake Application

Designing a Drake Execution Platform

You can access to this page through the link http://www.ida.liu.se/labs/pelab/drake/index.en.shtml

Drake is a plain C based programming framework and toolchain for the portable design and platform-specific optimized deployment of streaming applications on many-core processors.

Drake allows for a high-level, portable specification of streaming applications composed from streaming tasks in a modular way, separating between the following concerns:

Task execution (we also support moldable tasks, i.e., internally parallelizable tasks that can use multiple cores to possibly speed up their execution);
Communication (forwarding of intermediate results between tasks) by FIFO message passing channels (with platform-specific implementations)
Platform-specific schedules (in a general sense, i.e., including resource allocation and assignment, mapping, and DVFS), possibly computed automatically by scheduling algorithms.

While task code is based on plain C, Drake streaming application code may still be optimized for specific target execution platforms, for instance by leveraging SIMD instructions.

Drake makes experimenting with scheduling algorithms and their systematic quantitative evaluation considerably easier, as the scheduling-researcher does not have to rewrite an application for different schedules or platforms, and the experiments and the visualization of the results can be automatized.

Background and Motivation

Stream computing is a parallel computation model based on the pipelining pattern. It abstracts computations into separate streaming tasks that run possibly concurrently (all tasks run at the same time) and possibly in parallel (a unique task may run on several cores); communication links model the forwarding of intermediate data packets between producing and consuming tasks along FIFO-buffered communication channels, thus forming a streaming task graph.

This approach limits the programming complexity to tasks of lower parallelism or even sequential, and it separates application-specific operations on one hand and message sending and reception on the other hand.

The performance of streaming applications is influenced by the scheduling strategy, static and dynamic, that consists in deciding the set of cores to run each task in the graph as well as the core's voltage and frequency while tasks can execute the same application-specific program.

We aim at a high-level programming toolchain for streaming tasks that enables portability and, where possible, even performance portability across platforms, i.e., a unique streaming application shall be able to run on different execution platforms, ported through platform-specific message passing routines as well as specific schedules for an application and a platform.

Designing experiments with streaming applications on real execution platforms is difficult. The programmer needs to adapt implementations to the platform that runs them so as to get the best performance and collect meaningful data, and repeat the optimization work for every execution platform is use, for every application. This process is error-prone and time consuming especially when parallelism enters the scene.

Drake is a toolchain to help with the design and automatic deployment of streaming applications to multicore and manycore target architectures.
Being part of the Mimer toolchain, Drake thus also enables the automated systematic experimental evaluation of scheduling algorithms, providing a service to the scheduling research community.

Main Features

Separation between application-specific code, communication routines and scheduling strategies
Plain C/C++ framework
Predefined backend for Intel IA and Single Chip Cloud computer (SCC)
Integrates into Mimer's workflow

Publications

Nicolas Melot, Johan Janzen, Christoph Kessler: Mimer and Schedeval: Comparison Tools for Static Schedulers and Streaming Applications on Concrete Manycore Architectures. Eighth International Workshop on Parallel Programming Models and System Software for High-End Computing (P2S2) at ICPP, Sep. 2015, Proc. ICPPW-2015, to appear, IEEE.

Software License

The source code available above is released under GPLv3 licence. Please contact Nicolas Melot (nicolas.melot@liu.se) if you want to obtain another licence.

Download

Drake

Dependencies

Pelib
Unit

Platform backend

Drake for SCC

See also the html documentation generated by doxygen.

Installation

Drake requires pelib to be built and installed. Please refer to its documention on how to build and install it.

Define environment variables CC=gcc and CXX=g++ (or replace these values with your favorite C and C++ compilers, respectively, as long as they use the same command line switches).

Once all dependencies are installed, download Drake, extract the package and open a terminal and enter the directory where drake is extracted, then type

make install

to install it. It creates directories $prefix/bin, $prefix/lib, $prefix/lib/pkgconfig and $prefix/include and installs binaries, static and dynamic libraries, pkgconfig library information and include files in these directories, respectively. You can type

make install prefix=/some/other/directory

to install it elsewhere.
Binaries are built with the library search path including $prefix/lib, so you may need to give a custom prefix location when linking binaries as well.

Design a Drake streaming application

The design of a Drake application involves the design of the streaming application graph, the implementation of the module each task runs, and the elaboration of a static schedule and a main program to load and run the streaming application. The sections below shows the structure of each of them.

Main C program

The following code snippet shows a minimal program that runs a Drake stream.

First, drake_arch_init() initializes the library that manages the execution platform. This function can take a pointer to some memory address that may be necessary to initialize the platform; see Section Examples for an example.

Then, we can create a stream, that is, instantiate an existing Drake stream application. Here, we instantiate pingpong (see Section Examples for more information), with drake_stream_create(). drake_stream_create() is a macro that takes a drake stream handler data structure instance and a drake application name as parameters and generates code that instantiates the stream and writes management data into the drake stream handler data structure (here, the structure instance "stream").

We can then initialize the application with drake_stream_init, using the stream handler returned by drake_stream_create() and a pointer to memory that is passed to tasks for their initialization (see Section Drake application module).

drake_stream_run() runs an initialized stream; once again it takes a stream handler structure. A call to drake_stream_run() is blocking and returns when all tasks mapped to processors terminated.

Finally, drake_stream_destroy() and drake_arch_finalize() destroy all tasks and clean up memory.

#include 

int main(size_t argc, char **argv)
{
        // Initialize platform backend
        drake_arch_init(NULL);

        // Create new stream for pingpong application and store handler in stream
        drake_stream_t stream;
        drake_stream_create(&stream, pingpong);

        // Initialize stream
        drake_stream_init(&stream, NULL);

        // Run stream. This is a blocking call
        drake_stream_run(&stream);

        // Destroy the stream
        drake_stream_destroy(&stream);

        // Cleanup stream
        drake_arch_finalize(NULL);

        return 0;
}

Depending on the situation, more code may be necessary. For instance, the Drake backend for SCC requires arguments passed as an array of strings to function. Also, more code may be necessary to monitor performance.

Drake Application Taskgraph

A streaming application graph (taskgraph) must be expressed using the GraphML syntax, possibly using yEd. The code belows shows the example of a ping-pong application that consists in 2 tasks, ping and pong that exchange a message to each other:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
         http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<!-- Created by igraph -->
  <key id="g_name" for="graph" attr.name="name" attr.type="string"/>
  <key id="g_label" for="graph" attr.name="label" attr.type="string"/>
  <key id="g_deadline" for="graph" attr.name="deadline" attr.type="string"/>
  <key id="v_name" for="node" attr.name="name" attr.type="string"/>
  <key id="v_module" for="node" attr.name="module" attr.type="string"/>
  <key id="v_workload" for="node" attr.name="workload" attr.type="double"/>
  <key id="v_max_width" for="node" attr.name="max_width" attr.type="string"/>
  <key id="v_efficiency" for="node" attr.name="efficiency" attr.type="string"/>
  <graph id="G" edgedefault="directed">
    <data key="g_name">pingpong</data>
    <data key="g_deadline">0</data>
    <node id="n0">
      <data key="v_name">ping</data>
      <data key="v_module">ping</data>
      <data key="v_workload">1</data>
      <data key="v_max_width">1</data>
      <data key="v_efficiency">exprtk: p == 1 ? 1 : 1e-6</data>
    </node>
    <node id="n1">
      <data key="v_name">pong</data>
      <data key="v_module">pong</data>
      <data key="v_workload">1</data>
      <data key="v_max_width">1</data>
      <data key="v_efficiency">exprtk: p == 1 ? 1 : 1e-6</data>
    </node>
    <edge id="e0" source="n0" target="n1"/>
    <edge id="e1" source="n1" target="n0"/>
  </graph>
</graphml>

Figure 1 below is a visual representation generated by yEd.

Fig 1: Graphic representation of the ping-pong application.

Our streaming application graph description is based on GraphML and adds properties for the graph and the nodes:

Graph properties:

name (string): Name of the application implemented. Should be the same for every graph that implements the same application.
label (string): Human-readable alias for graph name. As for name, it should be the same for every graph that implements the same application.
deadline (string formula): Formula that computes the deadline in milliseconds of a stage of the pipeline running the application. See the pelib documentation.
If the taskgraph is used for theoretical studies only, consider the same unit as a task's workload (see below).

Node properties:

name (string): Human-readable, unique task name that identifies the node in the graph.
module (string): Drake module that contains the code the task executes
workload (double): Estimation of the number of instructions necessary to run an instance of the corresponding task in each stage of the pipeline, as in time complexity analysis.
max_width (string): Maximal number of processors that can participate to the execute the same task instance, providing task acceleration through parallel. Must be an integer higher or equal to 1 or "inf" or "+inf" to denote the absence of limit in the maximal number of processors.
efficiency (string formula): Formula that computes the parallel efficiency of the task as a function of the number of processors running it. See the pelib documentation.

Drake Application Module

All the code a task must run is defined in an application module. A module implements 5 functions that define the complete lifecycle of a streaming task: initialization, start, work, stop and destruction.
The code snippet below shows the skeleton of a module

#include <drake/node.h>

int
drake_init(task_t *task, void* aux)
{
	// Report that initialization went well
	return 1;
}

int
drake_start(task_t *task)
{
	// Report that the task is ready to start working
	return 1;
}

int
drake_run(task_t *task)
{
	// Make the task to terminate when there are no more data to read and all predecessor tasks are terminated
	return drake_task_depleted(task);
}

int
drake_kill(task_t *task)
{
	// Report that this function ran correctly
	return 1;
}

int
drake_destroy(task_t *task)
{
	// Report that this function ran correctly
	return 1;
}

When a Drake stream is initialized, each core runs the function drake_init() for each task that is mapped to the core.
Similarly, cores run drake_start() for each task when the stream begins to run, and they run it again until it returns a non-zero value. For instance, this functionality can be used to make sure that the main work function doesn't start before the task has received any input.
After drake_start() returns a non-zero value, a task runs drake_run(), that is the main working function of the task. If drake_run() returns 0, then the task runs again drake_run() at the next streaming pipeline stage, otherwise the task is considered as terminated. One can use drake_task_depleted() to compute a return value suitable for drake_run(). It returns 0 if at least one task's input stream holds at least one data element or if at least one of its predecessor task is still running.
When a task is terminated, it runs drake_kill() before the pipeline stops.
Finally, each core runs the function drake_destroy() for each task mapped to them, when the pipeline is destroyed.

Except for drake_start() and drake_run(), all task functions run exactly once and their return value indicate if the function without error (return non-zero) or if anything bad happened (return 0).

Any global variable or custom additional function must be declared static to be portable. Any non-static global variable of function may be compiled correctly if a module is used by at most one task, and might result in linking errors otherwise. Also, global variables may allow tasks to communicate with some backend platform, which is invalid in streaming applications and cannot work with platforms without any globally shared memory space such as the Intel SCC.

Drake Application Schedule

The schedule of a Drake application is written in XML.
The core snippet below shows a schedule for the ping-pong application of Section Drake application taskgraph.

<?xml version="1.0" encoding="UTF-8"?>
<schedule name="Generated from Algebra" appname="Generated from Algebra">
 <core coreid="1">
  <task name="ping" start="0.000000" frequency="800000.000000" width="1.000000" workload="1.000000"/>
 </core>
 <core coreid="2">
  <task name="pong" start="0.000000" frequency="800000.000000" width="1.000000" workload="1.000000"/>
 </core>
</schedule>

You can choose to design manually a schedule for a Drake application, or use a Mimer scheduler to compute one. See Mimer documentation for more information about the structure of an XML schedule.
You can use pelib-convert to generate C code from a XML schedule and compile it into the final application, for instance with the command below:

pelib-convert --input --format taskgraph-graphml --file taskgraph.graphml --input --format platform-ampl_input --file platform.dat --input --format schedule-xml --file schedule.xml --output --format schedule-drake --stdout

Compiling a Drake Application

A complete compiled Drake application consists in the main program, the compiled code for each task and the code that provides information about the streaming application such as the number of task or the static schedule. If several tasks share a unique module, its source code is compiled again for every task.

This is how to compile a task:

gcc -c -D_DRAKE_COMPILE -DAPPLICATION=pingpong -DTASK_MODULE=ping -DTASK_NAME=ping -o ping.o ping.c

where TASK_NAME=ping denotes the name of the task as in the corresponding taskgraph. "pingpong" is the name of the streaming application; it may vary from the graph name defined in the corresponding taskgraph.
TASK_MODULE=ping is the is the module of the task and ping.c is the source code of the module. The operation must be repeated for every task.

This is how to compile the scheduling information code generated by pelib-convert (see Drake application schedule):

gcc -c -DAPPLICATION=pingpong -o drakecc_pingpong.schedule.o pingpong.schedule.c

where "pingpong" is the application being compiled.

Finally, the main program can be compiled with the command:

gcc main.c schedule.o ping.o pong.o `pkg-config --cflags --libs drake drake-scc`

Alternatively, you can reuse the set of Makefiles available in existing Drake applications and adapt src/Makefile.in to your streaming application.

Designing a Drake Execution Platform

The role of a drake platform backend consists in managing platform-specific operations including message transmissions, frequency and voltage scaling as well as time and power monitoring. This section gives high-level instructions on how these operations are performed in Drake in order to facilitate any backend platform implementation.

Message passing

Drake assumes an individual on-chip communication memory associated to each core. A core has a fast read and write access to its associated communication memory, and a core can read or write another core's communication memory through direct or indirect memory operations. For this reason, we consider this on-chip memory as shared and we call it shared memory.

A drake platform backend provides a shared memory allocation function drake_local_malloc(), shared memory free function drake_local_free() as well as drake_arch_local_size(), that returns the size in bytes of the shared memory associated to a core.

Furthermore, a Drake backend platform implements drake_remote_addr(), that for an address in the caller core's shared memory and a core id, returns an address the core can directly read from or write to that reflects the remote core's own shared memory at the same address.

Data in memory addresses pointed by drake_remote_addr() are stalled until functions drake_arch_pull() and drake_arch_commit(), respectively before a read or after a write operation.

The fundamental strategy of Drake to implement communications consists of exposing the streaming task programmer's with memory addresses that can be read from or written to and use as well as data transmission triggers (drake_arch_pull() and drake_arch_commit()) to actually transport data between remote tasks.

Tasks communicate data through fifo data structures. Pelib fifos include a memory buffer that holds data elements that tasks read or write. When both tasks are mapped to a unique core, this memory buffer is allocated using the standard "malloc()" allocation function. However, if both tasks are mapped to different cores, then Drake allocates a portion of a stack allocated for each core when a stream in created using drake_local_malloc() for as much memory as returned by drake_arch_local_size(). Communications between 2 tasks mapped to a different cores happen as shown in Fig. 2.

Figure 2: Communication process in Drake.

We assume t1 takes its input from main memory and t2 forwards it to another task, local or distant. Since t1 and t2 are mapped to exactly two processors and each processor runs exactly one task, we call p1 for t1 and p2 for t2. p1 pushes messages toward processor p2. The fifo's buffer, read and write are issued from the stack buffer allocated using drake_local_malloc() and held in the link data structure.

Figure 2 shows 8 steps that we detail here:

t1 loads data from its processor's private memory, transitions to state START and pre-processes the data it holds. 12 transitions to state START and polls on write in its processor's local communication memory to check if any data arrived yet.
t1 transitions to state RUN, runs its TASKRUN function and writes output through the link to t2. This is done by writing output directly in the fifo buffer associated to the link. t1 does not notify t2 yet. t2 still polls on write in its local memory.
t1 cannot push any more data, therefore it stops its TASKRUN function. Drake notifies t2 by writing in write in t1's own copy of the link data structure allocated in p1. Then Drake runs drake_arch_commit() with the address of write (included in the bemory buffer returned by drake_local_malloc()), which initiates the data transfer. t2 still polls on write in its local memory.
Drake running on p2 reads write and updates t2's local copy of the fifo. Then it transitions t2 to state RUN and t2 begins immediately to process the data it received, popping data from its input link. t1 polls on read in its local memory.
When t2 cannot work any more, it exits its work function and Drake notifies t1 through read a message on how many items t2 consumed, using the same scheme as step 3. t1 polls on read in its local memory.
p1 receives the notification, updates its local copy of the read pointer for the shared fifo and consumes the rest of its input to output more tokens. Drake sends through t1's write in p1's local memory the number of tokens produced in total and use drake_arch_commit() to initiate another data transfers. Then it transitions t1 to state KILLED and transmits t1's new state to t2 through prod_state (also part of memory allocated with drake_local_malloc()) as well as the written data counter write, and runs t1's drake_killed() function and transition to ZOMBIE state. t2 polls on write in its local memory.
Drake on p2 receives the write notification and updates t2's local copy. t2 executes drake_run() and processes the new input. t2 knows that t1 will not produce any more data thanks to prod_state.
t2 is done with computing its input. Since it has no more input to compute and all its predecessors passed state KILLED, it transitions to state KILLED. This step is not represented in Fig. 2.

Figure 2 shows all local and distant messages exchanged through a communication round. All polling is performed locally and distant messages are sent exactly once from the producer to the consumer for data and notification, and exactly once from the consumer to the producer as requests to proceed. Note that a Drake application should use peekaddr(), discard() as well as writeaddr() and fill() instead of pop(), peek() and push() as the former aloow direct read and write operations for several elements with little addition work whereas the latter include lots of overhead and manipulate only one element at a time.

Voltage and Frequency scaling

Drake read voltage and frequency settings or set them with platform backend functions drake_get_frequency() and drake_get_voltage(), or drake_arch_set_frequency() and drake_arch_set_voltage(), respectively. Please refer to doxygen documentation for more information about them. Additionally, drake_arch_set_frequency_autoscale sets the caller core's frequency as well as the minimal voltage level required for the frequency.

Monitoring

Drake defines abstract monitory time and power monitoring data types. They are defined as pointers to undefined structures so that any Drake backend platform implementation can model them in the most efficient manner the platform can offer.

The time data structure is intended to store the time upon any call to drake_get_time(). See doxygen documentation for more information about allocation, free, display or arithmetic operations on time required by Drake.

The power data structure manages the power monitoring and stores the data collected so that more platform backend implementation can display it on screen in formatted output. See power-related functions in doxygen documentation for more information of power allocation, free, display operations as well as on how Drake use power monitoring functions.

Examples

There are a few Drake application already available.

Ping-pong is an example Drake application to get programmers started.
Mergesort implements a 6-level merging tree pipeline (63 tasks) and is used for performance measurement of Drake in our paper on Mimer and Schedeval.
fft2_streamit is a re-implementation of the FFT implementation in the StreamIt benchmark suite and is used as a scheduler evaluator regarding frequency scaling and power consumption. Its implementation could be improved by using peekaddr(), discard() as well as writeaddr() and fill() instead of pop(), peek() and push().

You can download and extract any or all of the application examples and compile them with

make install

Note that you need pelib, Drake, as well as some Mimer scheduler installed. By default, these applications use a simple crown heuristic scheduler based on LTLG and Height heuristics. You can install this scheduler by downloading and installing the crown scheduling library.

Drake-scc is an implementation of a drake platform backend for the Single-Chip Cloud computer (SCC) as well as its emulator for classic Intel IA32 or IA64 processors and Linux.
Drake-scc depends on pelib-scc as well as a modified version of RCCE. In order to use drake-scc, download it as well as its dependencies and extract all packages. If you have a SCC available, you can compile it for the SCC; otherwise you need to use the emulator. Open a terminal in the directory where RCCE is extracted and type (for the reall SCC)
```
./configure SCC_LINUX ADD_POWER_API
```
or (for the SCC emulator, most people)
```
./configure emulator ADD_POWER_API
```
then build all files and copy the pkg-config package information into your system
```
./makeall
cp *.pc $prefix/lib/pkgconfig
```
Then you can compile and install pelib-scc and drake-scc, in this order. If you use the real SCC, open a terminal in pelib-scc directory, define CC=icc, and type
```
make install SCC_PLATFORM=SCC_LINUX
```
and repeat the operation in drake-scc directory. If you use the SCC emulator, type the command below instead of the one above
```
make install SCC_PLATFORM=OMP
```

Ongoing work

This is a work in progress. Future work includes function interface simplification and the definition of a more consistent naming convention. In particular, remote shared memory allocation should be done without a local stack and memory local address translation to remote address, but directly from a call to memory allocation function with the target core. Also, shared memory allocation can be split between data and mailboxes and include the possibility to allocate less memory than requested (and return the actual number amount available), so that the memory allocator can return align memory addresses allowing SIMD operations. Also, we plan to improve our implementation of communication channels between parallel tasks. We also plan to implement a drake backend for Xeon processors.

If you would like to contribute, please let us know.

Contact

Nicolas Melot
Prof. Christoph Kessler

For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".

Acknowledgments

This work was partly funded by the EU FP7 project EXCESS and by SeRC project OpCoReS.

Drake

Generic Framework for Portable Specification of Streaming Applications and Deployment for On-Chip-Pipelined Execution on Manycore Systems

Background and Motivation

Main Features

Publications

Software License

Download

Dependencies

Platform backend

Drake for SCC

Drake for Intel-IA

Schedulers

Example

Installation

Design a Drake streaming application

Main C program

Drake Application Taskgraph

Drake Application Module

Drake Application Schedule

Compiling a Drake Application

Designing a Drake Execution Platform

Message passing

Voltage and Frequency scaling

Monitoring

Examples

Ongoing work

Contact

Acknowledgments

Generic Framework for Portable Specification of Streaming Applications
and Deployment for On-Chip-Pipelined Execution on Manycore Systems