DF21500
Multicore Computing (5p / 7.5hp)
Architectures, Models, Methods, Languages

HT1/2013

Important message (Nov 2012):
Due to heavy work load and unexpected changes of course responsibilities, this Multicore Computing course, originally announced for spring 2013, will be postponed until (early) fall 2013 (probably, end of august and early september for the lecture blocks, mid september for the labs).

Recommended for
Graduate (CUGS, CIS, ...) students interested in the areas of parallel computer architecture, parallel programming, software engineering, optimization, compiler construction, or algorithms and complexity.
The course is hosted by the CUGS graduate school as an Advanced Course.
Interested undergraduate students are also welcome, as long as there are free seats left.

Registration (internal access only).
(If you have no IDA research account, please contact the CUGS secretary, Anne Moe (annes \at ida.liu.se) for registration.)
List of participants 2011 (internal access only).

Organization
Lectures (ca. 32h), programming exercises, optional theoretical exercises for self-assessment, programming lab assignment, student presentations.
The course will be held as block course with two intensive weeks in Linköping.

The course was last given
in 2011 and 2009. It partly overlaps (1p = 1.5hp) with the older course Advanced Parallel Programming which was last given 2007.
This course complements the existing graduate course FDA101 on Parallel Programming (4p), which is aliased to the undergraduate course TDDC78.

Goals
The course emphasizes fundamental aspects of shared-memory and accelerator-based parallel programming, such as shared memory parallel architecture concepts, programming models, performance models, parallel algorithmic paradigms, parallelization techniques and strategies, scheduling algorithms, optimization, composition of parallel programs, and concepts of modern parallel programming languages and systems. Practical exercises help to apply the theoretical concepts of the course to solve concrete problems in a real-world multicore system.

Prerequisites
Data structures and algorithms (e.g. TDDB57) are absolutely required; knowledge in complexity theory and compiler construction is useful. Some basic knowledge of computer architecture is assumed. A basic course in concurrent programming (e.g. TDDB63/68/72) and parallel programming (e.g. FDA101/TDDC78 or TANA77) are recommended.
Programming in C and some familiarity with Linux (or similar OS) is necessary for the practical exercises.

Contents/Schedule
Lecture block in 2 intensive weeks (2 x 4 days at 4..6 hours) in Linköping. Weeks 7 (14-17/2/2011) and 9 (28/2-3/3) are confirmed.
Rooms: Alan Turing and John von Neumann (see the schedule)

Lecture Notes

are made available here for registered participants.

Schedule (prel.)

DayTimeRoom LecturerTopic
14 feb
2011
13:15-13:30A. TuringC. Kessler Organization and overview
13:30-17:00A. Turing C. Kessler Introduction to Multicore architecture concepts and trends.
Programming with POSIX threads.
15 feb09:15-12:00A. TuringC. Kessler Review of parallel computing theory:
PRAM model. Parallel time, cost, speedup. Amdahl's Law, Gustafsson's Law, Brent's Theorem. Scalability. NC. Basic data-parallel algorithmic building blocks: Parallel prefix sums, list ranking. Delay model. BSP model. LogP model.
Algorithmic design pattern: Parallel divide-and-conquer.
13:30-15:00
15:30-17:00
A. Turing
v.Neumann
C. Kessler Cell BE architecture and programming.
SIMD computing.
16 feb09:15-12:00v.NeumannC. Kessler Parallel programming language concepts.
13:30-15:00v.NeumannE. Hansson Lesson and lab kick-off: Cell programming.
(Continue on your own.)
15:30-17:00v.NeumannC. Kessler Short introduction to Cilk and work-stealing scheduling.
Introduction to transactional programming.
17 feb09:15-12:00A. TuringI. Ragnemalm


J. Ogniewski
GPU Architecture,
Shader Programming,
CUDA.
OpenCL.
DayTimeRoomLecturerTopic
28 feb
2011
13:15-17:00A. TuringH. Sundell Non-blocking Synchronization.
Introduction to the NOBLE library of non-blocking data structures and algorithms.
Lock-free memory
Lock-free SkipList
Lock-free doubly linked list
1 mar09:15-12:00A. Turing J. Keller Parallel sorting;
Graph algorithms for external memory (briefly);
Algorithmic design pattern: On-Chip pipelining for Cell.
1 mar13:30-14:30v.NeumannJ. Keller On-line scheduling and scheduling of series-parallel task graphs.
1 mar15:00-16:30v.NeumannU. Dastgeer Lesson and
Lab kick-off: GPU programming with CUDA and OpenCL
(continue on your own)
2 mar09:15-10:30v.NeumannC. Kessler Generic parallel programming with skeletons: Concepts and
generative realization for GPUs with SkePU.
Mid-term evaluation.
2 mar10:45-12:00v.NeumannW. Löwe Bridging models and machines: BSP, LogP.
2 mar13:15-14:30A. TuringW. Löwe Simulations between models.
2 mar14:45-17:00A. TuringW. Löwe Static clustering and scheduling of task graphs.
Scheduling malleable tasks.
3 mar09:15-10:00v.NeumannE. Hansson Cell-lab, supervised time / demonstration
3 mar10:15-10:45v.NeumannC. Kessler Guidelines for Paper Presentations.
Distribution of papers for student presentations.
Paper assignment
3 mar10:45-12:00v.NeumannC. Kessler Autotuning overview.
Optimized composition of performance-aware parallel components.
Performance portability for heterogeneous multicore systems: The PEPPHER Approach.

3 mar12:00-12:10v.NeumannC. Kessler Wrap-up and evaluation.
DayTimeRoomLecturerTopic
15 mar 09:00-13:00 v.Neumann Written exam
23 mar 09:00-17:00 A. Turing Student presentation session
10 jun 13:00-17:00 v.Neumann Re-exam opportunity
Remark: Breaks are not shown in this schedule, and will be set as appropriate.


Labs

You can do either the Cell lab or the GPU lab (or both).
The lab moment of the course (1.5hp) is passed if the Cell lab OR the GPU lab is passed. An extra 1.5 hp is awarded if you pass both.

  1. Cell-Lab: Programming assignment with the CELL BE, a heterogeneous multicore processor.

    Prerequisites for the labs: C programming, Linux, and a basic course in concurrent programming and operating systems (e.g. TDDB68).

    Supervision and examination: Erik Hansson

  2. GPU-Lab: Programming assignment for GPUs with CUDA or OpenCL.

    Prerequisites for the labs: C programming, Linux, and a basic course in concurrent programming and operating systems (e.g. TDDB68).

    Supervision and examination: Usman Dastgeer

Literature
As main course book, we will use:

If you would like to do the Cell programming lab, the following may be useful: A survey paper on multicore architecture and programming: Background reading on the theory of parallel computation and on parallel algorithms: Background reading on non-blocking data structures:

Further literature references will be added later.

Teachers

Examination

TEN1: Written exam (Welf, Christoph) 3p = 4.5hp. Due to partial overlap, 2p = 3hp for those who already have passed the exam in FDA125 Advanced Parallel Programming earlier.

UPG1: Programming exercise (lab report) 1p = 1.5hp (or 2p = 3.0hp if both labs are solved correctly).

UPG2: Paper presentation, opposition, and written summary (1-2 pages, deadline 4/4/2011) of the presented paper, 1p = 1.5hp.

Credit
7.5hp if all examination moments are fulfilled, up to 9hp if both labs are passed.
Partial credits can be given, too, but the written exam must be passed.

Comments
There is a partial overlap in contents with the earlier course FDA125 Advanced Parallel Programming, corresponding to 1p = 1.5hp in the written exam.

In contrast to FDA125, we focus in this course on shared memory and multicore systems, also heterogeneous multicore and GPU architectures, and add more material on software engineering techniques for parallel systems, such as design patterns for synchronization and optimized composition of parallel components. We also give more emphasis to parallelization techniques that are especially useful for multicore systems, such as SIMD programming, speculative parallelization, transactional programming and lock-free synchronization. Scheduling for shared memory systems will be considered in more detail, in particular scheduling algorithms for malleable task graphs.


This page is maintained by Christoph Kessler (chrke \at ida.liu.se)