NestStep

a bulk-synchronous parallel (BSP) global address space language
supporting nested parallelism

by Christoph W. Kessler

Introduction

NestStep is a partitioned global address space (PGAS) language for the bulk-synchronous parallel (BSP) programming model.
NestStep is designed as an explicitly parallel extension of existing imperative programming languages such as C or Java. Several instances of the same NestStep program run on different processors and communicate via an interconnection network and the NestStep run time system, which emulates a CRCW-BSP computer with a virtually shared memory on top of a message-passing system.

The key features of NestStep are:

The language design of NestStep was developed 1998-2000 by Christoph W. Kessler, then at the University of Trier, Germany.
The NestStep language extensions have been defined for Java (NestStep-Java, 1998), for C (NestStep-C, 2000) and for the imperative part of Modelica (NestStepModelica, 2006).
Implementations of a NestStep run-time system have been done in Java (1999) and C (2000, 2006). See the implementation and download information further below for technical details of the implementations.
The current version of the NestStep-C run-time system is available for download.

NestStep is well-suited for parallel applications that match the BSP model, such as most parallel linear algebra computations, iterative solvers for partial differential equations etc., see for instance Bisseling's book for a presentation of BSP algorithms in scientific computing.
In contrast, NestStep is not appropriate for all problems that inherently require sequential consistency or bilateral mutual exclusion synchronization, because this conflicts with the superstep consistency of NestStep. In certain cases, such applications could be transformed into bulk-synchronous form without introducing prohibitive overhead. An elaboration on this topic can be found in the PhD thesis by Arturo Gonzalez-Escribano (2003).

BSP model of parallel computation

The BSP (bulk-synchronous parallel) model, as introduced 1990 by Valiant and implemented e.g. by the Oxford BSPlib library and the Paderborn University BSP library for many parallel architectures, structures a parallel computation of p processors into a sequence of supersteps that are separated by global barrier synchronization points.

A superstep consists of
(1) a phase of local computation of each processor, where only local variables (and locally held copies of remote variables) can be accessed, and
(2) a communication phase that sends some data to the processors that may need them in the next superstep(s), and then waits for incoming messages and processes them.
The separating barrier after a superstep is not necessary if it is implicitly covered by (2). For instance, if the communication pattern in (2) is a complete exchange, a processor can proceed to the next superstep as soon as it has received a message from every other processor.

In order to simplify cost predictions, the BSP machine model is characterized by only four parameters: the number p of processors, the per-processor byte transfer rate g the latency L (i.e. the minimum time between subsequent synchronizations), and the processor speed, s.

Additionally, some properties of the program dramatically influence the run time, in particular the communication behaviour. By convention, the maximum total number of bytes in all outgoing or incoming messages for a single processor at the end of a specific superstep is denoted by h. A communication pattern that bounds the maximum communication indegree and outdegree of any processor by h is called h-relation. Hence, a superstep performing balanced local work w and requiring the system to realize an h relation in the communication phase has time complexity w+gh+L.

The BSP model, as originally defined, has no support for shared memory. Also, in BSP there is no support for processor subset synchronization, i.e. for nesting of supersteps. Thus, programs can only exploit one-dimensional parallelism or must apply a flattening-transformation that converts nested parallelism to flat parallelism. However, automatic flattening by the compiler has only been achieved for SIMD parallelism (e.g. NESL).

More links on BSP:

Nested parallelism by nesting of supersteps

In NestStep a superstep is marked by a step keyword preceding a statement. Such step statements denote supersteps that are executed by entire groups of processors in a bulk-synchronous way.

NestStep introduces static and dynamic nesting of supersteps, and thus directly supports nested parallelism. There are several variants of nesting of supersteps, marked by parameter(s) of the neststep statement.

Differences from thread-parallel programming

NestStep processes differ from threads, as occurring e.g. in OpenMP, pthreads or Java threads: For instance they do not share any memory by default, and synchronized or similar concepts for Java threads work only on the local JVM (Java Virtual Machine). Rather, NestStep-Java processes run on different JVM instances on maybe different machines that are coupled by the NestStep language extensions and run-time system to a virtual parallel computer. This is why we prefer to call them processors in the sequel. The use of pthreads, Java threads or RMI in addition to the NestStep extensions is not explicitly prohibited but its effect is implementation dependent and solely under the programmers responsibility.

SPMD execution in NestStep

The main() function in a NestStep-C program, or resp. the main method of a class whose name has been passed as parameter to the NestStep-Java driver program, is executed by all available processors specified in the host file. Their number will remain constant throughout program execution (SPMD-model of parallel program execution). These started processors form the so-called root group. Groups can be dynamically subdivided during program execution into subgroups, following the static nesting structure of the supersteps. All processors within the same group execute the same step statement.

Hierarchical processor group concept in NestStep

For each group there is on each processor belonging to it a class Group object. In particular, it contains the group size, the group ID, and the processor's rank within the group. This Group object for the current group is referenced in NestStep programs by thisgroup. thisgroup is initialized automatically when the group is created, updated as the group processes step statements, and restored to the parent group when the group terminates.

Shared variables, relaxed memory consistency, combining, parallel prefix

Shared variables are declared with the sh type extender, such as:

sh int count;

By default, a scalar shared variable is replicated across all processors of the group declaring it, i.e. each one has its personal copy. The step statement is the basic control mechanism for shared memory consistency: On all processors of the same group entering a step, the shared variable copies have, at entry, the same value.

At the end of the step, together with an implicit barrier, there is a group-wide combine phase. The modified copies of shared variables are combined and broadcast within the current group according to a programmable strategy, such that all processors again share the same values of the shared variables. The combine strategy can be individually specified for each shared variable at its declaration, by an extender of the sh keyword, and can also be adapted for any particular superstep. The set of predefined combine strategies include common concurrent write, arbitrary concurrent write, global reductions (sum, max, min, ...) and parallel prefix computations. The programmer may define own combine functions, too.

For space economy, NestStep also provides distributed shared arrays in a way similar to HPF and Split-C. There are powerful array distribution and redistribution operators available. The distribution, as well as the sharity of the array elements, is part of the array's type.

Example programs 1 (NestStep-Java)

Example programs 2 (NestStep-C)

NestStep implementations

The run-time system of NestStep-Java was implemented in Java in 1998 but soon abandoned because of bad performance, in particular, Java's (then) extremely slow object serialization and deserialization needed for the communication between different JVMs.
In 2000, the run-time system of a compiler for a NestStep-C was implemented (in C) on top of MPI.
In 2005/2006, the NestStep-C run-time system was reimplemented on top of the Tlib task library by Th. Rauber and G. Rünger, which provides group descriptors and hierarchical group splitting used in the implementation of our nested supersteps. On top of Tlib, the NestStep Runtime System adds supersteps and the emulation of virtually shared variables and arrays with programmable BSP consistency.
Together with the layers below it, the NestStep Runtime System provides thus a virtual shared-memory BSP machine; direct access from the NestStep-C level to Tlib or MPI functionality is not intended (albeit not actively prohibited).


NestStep-C Implementations

The NestStep-C implementation consists of the following components:

Cluster-NestStep-C:

  • MPI-Cluster NestStep-C run-time system, Version 2 (2006) by Joar Sohl, based on Tlib
    Based on an older version by C. Kessler (2000)

Cell-NestStep-C:

In 2007, another version of the NestStep-C run-time system was implemented for the CELL processor:

NestStep-Modelica

The NestStepModelica implementation combines an extension of the Modelica language and compiler to support the NestStep constructs, with the NestStep-C run-time system.
The language extensions, implemented in MetaModelica, are limited to the imperative part of the Modelica language, namely imperative computations of computationally heavy but side-effect free user functions occuring on the right-hand side of the ordinary differential equation system generated from a Modelica program. These computations are encapsulated in Modelica's algorithm construct.
The extended compiler generates C code that is linked with the NestStep-C run-time system.

Download

The prototype of the NestStep-C runtime system, including the necessary files for the Tlib library, can be downloaded here:

A first version of the NestStep-C to C front end compiler, written by Magnus Holm during his final thesis (2010) project, is now available, along with an extended run-time system for Cell BE (nested supersteps are not supported yet).
Available from C. Kessler on request.
System requirements (tested on a Windows machine): Java 2 SDK SE 1.6.x (or later), ANTLRv2, GCC.

Publications:

Related projects:

NestStep is a so-called PGAS (partitioned global address space) parallel programming language. Other such languages include UPC Universal Parallel C (and its predecessors Split-C, PCP and AC), X11, Titanium, and Co-Array Fortran.

Some links:


This page by Christoph W. Kessler (chrke \at ida.liu.se)