Drake

Generic Framework for Portable Specification of Streaming Applications
and Deployment for On-Chip-Pipelined Execution on Manycore Systems


Background and Motivation Main Features Publications Software license Download Installation Design a Drake Streaming Application Compiling a Drake Application Designing a Drake Execution Platform Code Examples Ongoing Work Contact Acknowledgments

You can access to this page through the link http://www.ida.liu.se/labs/pelab/drake/index.en.shtml


Drake Workflow

Drake is a plain C based programming framework and toolchain for the portable design and platform-specific optimized deployment of streaming applications on many-core processors.

Drake allows for a high-level, portable specification of streaming applications composed from streaming tasks in a modular way, separating between the following concerns:

While task code is based on plain C, Drake streaming application code may still be optimized for specific target execution platforms, for instance by leveraging SIMD instructions.

Drake makes experimenting with scheduling algorithms and their systematic quantitative evaluation considerably easier, as the scheduling-researcher does not have to rewrite an application for different schedules or platforms, and the experiments and the visualization of the results can be automatized.


Background and Motivation

Stream computing is a parallel computation model based on the pipelining pattern. It abstracts computations into separate streaming tasks that run possibly concurrently (all tasks run at the same time) and possibly in parallel (a unique task may run on several cores); communication links model the forwarding of intermediate data packets between producing and consuming tasks along FIFO-buffered communication channels, thus forming a streaming task graph.

This approach limits the programming complexity to tasks of lower parallelism or even sequential, and it separates application-specific operations on one hand and message sending and reception on the other hand.

The performance of streaming applications is influenced by the scheduling strategy, static and dynamic, that consists in deciding the set of cores to run each task in the graph as well as the core's voltage and frequency while tasks can execute the same application-specific program.

We aim at a high-level programming toolchain for streaming tasks that enables portability and, where possible, even performance portability across platforms, i.e., a unique streaming application shall be able to run on different execution platforms, ported through platform-specific message passing routines as well as specific schedules for an application and a platform.

Designing experiments with streaming applications on real execution platforms is difficult. The programmer needs to adapt implementations to the platform that runs them so as to get the best performance and collect meaningful data, and repeat the optimization work for every execution platform is use, for every application. This process is error-prone and time consuming especially when parallelism enters the scene.

Drake is a toolchain to help with the design and automatic deployment of streaming applications to multicore and manycore target architectures.
Being part of the Mimer toolchain, Drake thus also enables the automated systematic experimental evaluation of scheduling algorithms, providing a service to the scheduling research community.


Main Features


Publications


Software License

The source code available above is released under GPLv3 licence. Please contact Nicolas Melot (nicolas.melot@liu.se) if you want to obtain another licence.



Download

Dependencies

Platform backend

Drake for SCC

Drake for Intel-IA

Schedulers

Example

See also the html documentation generated by doxygen.


Installation

Drake requires pelib to be built and installed. Please refer to its documention on how to build and install it.

Define environment variables CC=gcc and CXX=g++ (or replace these values with your favorite C and C++ compilers, respectively, as long as they use the same command line switches).

Once all dependencies are installed, download Drake, extract the package and open a terminal and enter the directory where drake is extracted, then type

make install
to install it. It creates directories $prefix/bin, $prefix/lib, $prefix/lib/pkgconfig and $prefix/include and installs binaries, static and dynamic libraries, pkgconfig library information and include files in these directories, respectively. You can type
make install prefix=/some/other/directory
to install it elsewhere.
Binaries are built with the library search path including $prefix/lib, so you may need to give a custom prefix location when linking binaries as well.


Design a Drake streaming application

The design of a Drake application involves the design of the streaming application graph, the implementation of the module each task runs, and the elaboration of a static schedule and a main program to load and run the streaming application. The sections below shows the structure of each of them.

Main C program

The following code snippet shows a minimal program that runs a Drake stream.

First, drake_arch_init() initializes the library that manages the execution platform. This function can take a pointer to some memory address that may be necessary to initialize the platform; see Section Examples for an example.

Then, we can create a stream, that is, instantiate an existing Drake stream application. Here, we instantiate pingpong (see Section Examples for more information), with drake_stream_create(). drake_stream_create() is a macro that takes a drake stream handler data structure instance and a drake application name as parameters and generates code that instantiates the stream and writes management data into the drake stream handler data structure (here, the structure instance "stream").

We can then initialize the application with drake_stream_init, using the stream handler returned by drake_stream_create() and a pointer to memory that is passed to tasks for their initialization (see Section Drake application module).

drake_stream_run() runs an initialized stream; once again it takes a stream handler structure. A call to drake_stream_run() is blocking and returns when all tasks mapped to processors terminated.

Finally, drake_stream_destroy() and drake_arch_finalize() destroy all tasks and clean up memory.

#include 

int main(size_t argc, char **argv)
{
        // Initialize platform backend
        drake_arch_init(NULL);

        // Create new stream for pingpong application and store handler in stream
        drake_stream_t stream;
        drake_stream_create(&stream, pingpong);

        // Initialize stream
        drake_stream_init(&stream, NULL);

        // Run stream. This is a blocking call
        drake_stream_run(&stream);

        // Destroy the stream
        drake_stream_destroy(&stream);

        // Cleanup stream
        drake_arch_finalize(NULL);

        return 0;
}

Depending on the situation, more code may be necessary. For instance, the Drake backend for SCC requires arguments passed as an array of strings to function. Also, more code may be necessary to monitor performance.

Drake Application Taskgraph

A streaming application graph (taskgraph) must be expressed using the GraphML syntax, possibly using yEd. The code belows shows the example of a ping-pong application that consists in 2 tasks, ping and pong that exchange a message to each other:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
         http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<!-- Created by igraph -->
  <key id="g_name" for="graph" attr.name="name" attr.type="string"/>
  <key id="g_label" for="graph" attr.name="label" attr.type="string"/>
  <key id="g_deadline" for="graph" attr.name="deadline" attr.type="string"/>
  <key id="v_name" for="node" attr.name="name" attr.type="string"/>
  <key id="v_module" for="node" attr.name="module" attr.type="string"/>
  <key id="v_workload" for="node" attr.name="workload" attr.type="double"/>
  <key id="v_max_width" for="node" attr.name="max_width" attr.type="string"/>
  <key id="v_efficiency" for="node" attr.name="efficiency" attr.type="string"/>
  <graph id="G" edgedefault="directed">
    <data key="g_name">pingpong</data>
    <data key="g_deadline">0</data>
    <node id="n0">
      <data key="v_name">ping</data>
      <data key="v_module">ping</data>
      <data key="v_workload">1</data>
      <data key="v_max_width">1</data>
      <data key="v_efficiency">exprtk: p == 1 ? 1 : 1e-6</data>
    </node>
    <node id="n1">
      <data key="v_name">pong</data>
      <data key="v_module">pong</data>
      <data key="v_workload">1</data>
      <data key="v_max_width">1</data>
      <data key="v_efficiency">exprtk: p == 1 ? 1 : 1e-6</data>
    </node>
    <edge id="e0" source="n0" target="n1"/>
    <edge id="e1" source="n1" target="n0"/>
  </graph>
</graphml>

Figure 1 below is a visual representation generated by yEd.

Fig 1: Graphic representation of the ping-pong application.

Our streaming application graph description is based on GraphML and adds properties for the graph and the nodes:

Graph properties:

Node properties:

Drake Application Module

All the code a task must run is defined in an application module. A module implements 5 functions that define the complete lifecycle of a streaming task: initialization, start, work, stop and destruction.
The code snippet below shows the skeleton of a module

#include <drake/node.h>

int
drake_init(task_t *task, void* aux)
{
	// Report that initialization went well
	return 1;
}

int
drake_start(task_t *task)
{
	// Report that the task is ready to start working
	return 1;
}

int
drake_run(task_t *task)
{
	// Make the task to terminate when there are no more data to read and all predecessor tasks are terminated
	return drake_task_depleted(task);
}

int
drake_kill(task_t *task)
{
	// Report that this function ran correctly
	return 1;
}

int
drake_destroy(task_t *task)
{
	// Report that this function ran correctly
	return 1;
}

When a Drake stream is initialized, each core runs the function drake_init() for each task that is mapped to the core.
Similarly, cores run drake_start() for each task when the stream begins to run, and they run it again until it returns a non-zero value. For instance, this functionality can be used to make sure that the main work function doesn't start before the task has received any input.
After drake_start() returns a non-zero value, a task runs drake_run(), that is the main working function of the task. If drake_run() returns 0, then the task runs again drake_run() at the next streaming pipeline stage, otherwise the task is considered as terminated. One can use drake_task_depleted() to compute a return value suitable for drake_run(). It returns 0 if at least one task's input stream holds at least one data element or if at least one of its predecessor task is still running.
When a task is terminated, it runs drake_kill() before the pipeline stops.
Finally, each core runs the function drake_destroy() for each task mapped to them, when the pipeline is destroyed.

Except for drake_start() and drake_run(), all task functions run exactly once and their return value indicate if the function without error (return non-zero) or if anything bad happened (return 0).

Any global variable or custom additional function must be declared static to be portable. Any non-static global variable of function may be compiled correctly if a module is used by at most one task, and might result in linking errors otherwise. Also, global variables may allow tasks to communicate with some backend platform, which is invalid in streaming applications and cannot work with platforms without any globally shared memory space such as the Intel SCC.

Drake Application Schedule

The schedule of a Drake application is written in XML.
The core snippet below shows a schedule for the ping-pong application of Section Drake application taskgraph.

<?xml version="1.0" encoding="UTF-8"?>
<schedule name="Generated from Algebra" appname="Generated from Algebra">
 <core coreid="1">
  <task name="ping" start="0.000000" frequency="800000.000000" width="1.000000" workload="1.000000"/>
 </core>
 <core coreid="2">
  <task name="pong" start="0.000000" frequency="800000.000000" width="1.000000" workload="1.000000"/>
 </core>
</schedule>
You can choose to design manually a schedule for a Drake application, or use a Mimer scheduler to compute one. See Mimer documentation for more information about the structure of an XML schedule.
You can use pelib-convert to generate C code from a XML schedule and compile it into the final application, for instance with the command below:
pelib-convert --input --format taskgraph-graphml --file taskgraph.graphml --input --format platform-ampl_input --file platform.dat --input --format schedule-xml --file schedule.xml --output --format schedule-drake --stdout



Compiling a Drake Application

A complete compiled Drake application consists in the main program, the compiled code for each task and the code that provides information about the streaming application such as the number of task or the static schedule. If several tasks share a unique module, its source code is compiled again for every task.

This is how to compile a task:

gcc -c -D_DRAKE_COMPILE -DAPPLICATION=pingpong -DTASK_MODULE=ping -DTASK_NAME=ping -o ping.o ping.c

where TASK_NAME=ping denotes the name of the task as in the corresponding taskgraph. "pingpong" is the name of the streaming application; it may vary from the graph name defined in the corresponding taskgraph.
TASK_MODULE=ping is the is the module of the task and ping.c is the source code of the module. The operation must be repeated for every task.

This is how to compile the scheduling information code generated by pelib-convert (see Drake application schedule):

gcc -c -DAPPLICATION=pingpong -o drakecc_pingpong.schedule.o pingpong.schedule.c
where "pingpong" is the application being compiled.

Finally, the main program can be compiled with the command:

gcc main.c schedule.o ping.o pong.o `pkg-config --cflags --libs drake drake-scc`

Alternatively, you can reuse the set of Makefiles available in existing Drake applications and adapt src/Makefile.in to your streaming application.



Designing a Drake Execution Platform

The role of a drake platform backend consists in managing platform-specific operations including message transmissions, frequency and voltage scaling as well as time and power monitoring. This section gives high-level instructions on how these operations are performed in Drake in order to facilitate any backend platform implementation.

Message passing

Drake assumes an individual on-chip communication memory associated to each core. A core has a fast read and write access to its associated communication memory, and a core can read or write another core's communication memory through direct or indirect memory operations. For this reason, we consider this on-chip memory as shared and we call it shared memory.

A drake platform backend provides a shared memory allocation function drake_local_malloc(), shared memory free function drake_local_free() as well as drake_arch_local_size(), that returns the size in bytes of the shared memory associated to a core.

Furthermore, a Drake backend platform implements drake_remote_addr(), that for an address in the caller core's shared memory and a core id, returns an address the core can directly read from or write to that reflects the remote core's own shared memory at the same address.

Data in memory addresses pointed by drake_remote_addr() are stalled until functions drake_arch_pull() and drake_arch_commit(), respectively before a read or after a write operation.

The fundamental strategy of Drake to implement communications consists of exposing the streaming task programmer's with memory addresses that can be read from or written to and use as well as data transmission triggers (drake_arch_pull() and drake_arch_commit()) to actually transport data between remote tasks.

Tasks communicate data through fifo data structures. Pelib fifos include a memory buffer that holds data elements that tasks read or write. When both tasks are mapped to a unique core, this memory buffer is allocated using the standard "malloc()" allocation function. However, if both tasks are mapped to different cores, then Drake allocates a portion of a stack allocated for each core when a stream in created using drake_local_malloc() for as much memory as returned by drake_arch_local_size(). Communications between 2 tasks mapped to a different cores happen as shown in Fig. 2.

Figure 2: Communication process in Drake.
We assume t1 takes its input from main memory and t2 forwards it to another task, local or distant. Since t1 and t2 are mapped to exactly two processors and each processor runs exactly one task, we call p1 for t1 and p2 for t2. p1 pushes messages toward processor p2. The fifo's buffer, read and write are issued from the stack buffer allocated using drake_local_malloc() and held in the link data structure.

Figure 2 shows 8 steps that we detail here:

Figure 2 shows all local and distant messages exchanged through a communication round. All polling is performed locally and distant messages are sent exactly once from the producer to the consumer for data and notification, and exactly once from the consumer to the producer as requests to proceed. Note that a Drake application should use peekaddr(), discard() as well as writeaddr() and fill() instead of pop(), peek() and push() as the former aloow direct read and write operations for several elements with little addition work whereas the latter include lots of overhead and manipulate only one element at a time.

Voltage and Frequency scaling

Drake read voltage and frequency settings or set them with platform backend functions drake_get_frequency() and drake_get_voltage(), or drake_arch_set_frequency() and drake_arch_set_voltage(), respectively. Please refer to doxygen documentation for more information about them. Additionally, drake_arch_set_frequency_autoscale sets the caller core's frequency as well as the minimal voltage level required for the frequency.

Monitoring

Drake defines abstract monitory time and power monitoring data types. They are defined as pointers to undefined structures so that any Drake backend platform implementation can model them in the most efficient manner the platform can offer.

The time data structure is intended to store the time upon any call to drake_get_time(). See doxygen documentation for more information about allocation, free, display or arithmetic operations on time required by Drake.

The power data structure manages the power monitoring and stores the data collected so that more platform backend implementation can display it on screen in formatted output. See power-related functions in doxygen documentation for more information of power allocation, free, display operations as well as on how Drake use power monitoring functions.



Examples

There are a few Drake application already available.

You can download and extract any or all of the application examples and compile them with
make install
Note that you need pelib, Drake, as well as some Mimer scheduler installed. By default, these applications use a simple crown heuristic scheduler based on LTLG and Height heuristics. You can install this scheduler by downloading and installing the crown scheduling library.


Ongoing work

This is a work in progress. Future work includes function interface simplification and the definition of a more consistent naming convention. In particular, remote shared memory allocation should be done without a local stack and memory local address translation to remote address, but directly from a call to memory allocation function with the target core. Also, shared memory allocation can be split between data and mailboxes and include the possibility to allocate less memory than requested (and return the actual number amount available), so that the memory allocator can return align memory addresses allowing SIMD operations. Also, we plan to improve our implementation of communication channels between parallel tasks. We also plan to implement a drake backend for Xeon processors.

If you would like to contribute, please let us know.



Contact

For reporting bugs, please email to "<firstname> DOT <lastname> AT liu DOT se".

Acknowledgments

This work was partly funded by the EU FP7 project EXCESS and by SeRC project OpCoReS.