Recommended for
Graduate (CUGS, ECSEL, ...) students interested in the areas of
parallel computer architecture, parallel programming, software engineering,
optimization, compiler construction, or algorithms and complexity.
The course is hosted by the
CUGS graduate school
as an Advanced Course.
Interested undergraduate students are also welcome, as long as there are
free seats left.
Registration (internal access only).
(If you have no IDA research account,
please contact the CUGS secretary, Anne Moe (annes \at ida.liu.se)
for registration.)
List of participants (internal access only).
Organization
Lectures (ca. 32h),
programming exercises, optional
theoretical exercises for self-assessment, programming lab assignment,
student presentations.
The course will be held as block course with two intensive weeks
in Linköping.
The course was last given
This is a new course (2009). It partly overlaps (1p = 1.5hp) with
Advanced Parallel Programming which was last given 2007.
This course complements the existing
graduate course FDA101 on
Parallel Programming (4p), which is aliased to the undergraduate course
TDDC78.
Goals
The course emphasizes fundamental aspects of shared-memory parallel
programming such as
shared memory parallel architecture concepts, programming models,
performance models,
parallel algorithmic paradigms, parallelization techniques and
strategies, scheduling algorithms, optimization, composition of
parallel programs, and
concepts of modern parallel programming languages and systems.
Practical exercises help to apply the theoretical concepts
of the course to solve concrete problems in a real-world
multicore system.
Prerequisites
Data structures and algorithms (e.g. TDDB57) are absolutely required;
knowledge in complexity theory and compiler construction is useful.
Some basic knowledge of computer architecture is assumed.
A basic course in concurrent programming
(e.g. TDDB63/68/72) and parallel programming
(e.g. FDA101/TDDC78 or TANA77) are recommended.
Programming in C and some familiarity with Linux (or similar OS)
is necessary for the practical exercises.
Contents/Schedule
Lecture block in 2 intensive weeks (2 x 4 days at 4..6 hours)
in Linköping. Room:
Donald Knuth
| Day | Time | Room | Lecturer | Topic |
| 9 feb 2009 | 13:15 | Knuth | C. Kessler | Organization and overview |
| 13:30 -17:00 | J. Keller |
Parallel computer architecture concepts. Programming with POSIX threads. |
||
| 10 feb | 09:15 -12:00 | Knuth | J. Keller | Parallel sorting; Graph algorithms for external memory |
| 13:30 -17:00 | C. Kessler | Cell BE architecture and programming. (Slides sent to registered participants by e-mail) | >
||
| 18:15 -20:00 | Visionen | UppLYSning: Introduktion till CUDA. (In swedish). (optional evening seminar, not a part of the course, but fits well.) |
||
| 11 feb | 09:15 -11:00 | Knuth | D. Liu | SIMD architecture and programming (Slides sent to registered participants by e-mail) |
| 11:15 -12:00, 13:30 -16:15 | C. Kessler | Repetition of parallel computing theory: PRAM model. Parallel time, cost, speedup. Amdahl's Law, Gustafsson's Law, Brent's Theorem. Scalability. NC. Basic data-parallel algorithmic building blocks: Parallel prefix sums, list ranking. Delay model. BSP model. LogP model. Algorithmic design pattern: Parallel divide-and-conquer. Short introduction to Cilk. |
||
| 16:30 -17:00 | M. Eriksson | Lab kick-off: Getting started with Cell programming. Continue on your own. |
||
| 12 feb | 08:30 -10:00 | Knuth | C. Kessler |
Parallel programming language concepts. (Slide set covers language concepts from OpenMP, MPI, Cilk, UPC, Java, etc.; also includes dynamic task scheduling, transactional programming.) |
| Day | Time | Room | Lecturer | Topic |
| 23 feb 2009 | 13:15 -17:00 | Knuth | W. Löwe |
Bridging models and machines: BSP, LogP. Simulations between models. |
| 24 feb | 09:15 -12:00 | Knuth | W. Löwe |
Static clustering and scheduling of task graphs. Scheduling malleable tasks.
Data distribution |
| 13:30 -17:00 | Knuth | H. Sundell |
Non-blocking Synchronization. Introduction to the NOBLE library of non-blocking data structures and algorithms. Lock-free memory Lock-free SkipList Lock-free doubly linked list |
|
| 25 feb | 09:15 -12:00 | Knuth | C. Kessler | Generic parallel programming with skeletons: Concepts and generative realization for Cell BE. |
| " | " | C. Kessler | Algorithmic design pattern: On-Chip pipelining for Cell. | |
| 13:30 -16:15 | Knuth | C. Kessler |
Optimized composition of parallel components. Autotuning overview. (Have also a look at F. Bodin's HiPEAC'09 keynote slides). |
|
| " | " | C. Kessler |
Run-time parallelization and speculative parallelization. Transactional programming. (see earlier slide set, and course book Ch. 18) |
|
| 16:30 -17:00 | Knuth | C. Kessler | Distribution of papers for student presentations | |
| 26 feb | 09:15 -11:45 | Knuth | I. Ragnemalm | GPU architecture and programming with CUDA and OpenCL. Slide set: 1.pdf (Intro), 2.pdf (Shaders), 4.pdf (GPGPU). 5.pdf (CUDA/G80) sent by e-mail to all registered participants; here is a shortened version.. |
| 11:45 -12:00 | Knuth | C. Kessler | Wrap-up and evaluation. | |
| Day | Time | Room | Lecturer | Topic |
| 6 march 2009 | 10:15-17:00 | Knuth | Student presentations | |
| 13 march 2009 | 13:30-17:00 | Knuth | Written exam |
Labs
Programming assignments with the CELL BE, a heterogeneous multicore processor.
Literature
As main course book, we will use:
Further literature references will be added later.
Teachers
Examination
TEN1:
Written exam
(Welf, Christoph)
3p = 4.5hp. Due to partial overlap, 2p = 3hp
for those who already have passed the exam in FDA125 Advanced
Parallel Programming earlier.
Credit
5p = 7.5hp if all examination moments are fulfilled.
Partial credits can be given, too.
Comments
There is a partial overlap in contents with
FDA125 Advanced Parallel Programming, corresponding to 1p = 1.5hp
in the written exam.
In contrast to FDA125, we focus in this course on shared memory and multicore systems, also heterogeneous multicore and GPU architectures, and add more material on software engineering techniques for parallel systems, such as design patterns for synchronization and optimized composition of parallel components. We also give more emphasis to parallelization techniques that are especially useful for multicore systems, such as SIMD programming, speculative parallelization, transactional programming and lock-free synchronization. Scheduling for shared memory systems will be considered in more detail, in particular scheduling algorithms for malleable task graphs.