DF21500
Multicore Computing (5p / 7.5hp)
Architectures, Models, Methods, Languages

VT1/2009

Recommended for
Graduate (CUGS, ECSEL, ...) students interested in the areas of parallel computer architecture, parallel programming, software engineering, optimization, compiler construction, or algorithms and complexity.
The course is hosted by the CUGS graduate school as an Advanced Course.
Interested undergraduate students are also welcome, as long as there are free seats left.

Registration (internal access only).
(If you have no IDA research account, please contact the CUGS secretary, Anne Moe (annes \at ida.liu.se) for registration.)
List of participants (internal access only).

Organization
Lectures (ca. 32h), programming exercises, optional theoretical exercises for self-assessment, programming lab assignment, student presentations.
The course will be held as block course with two intensive weeks in Linköping.

The course was last given
This is a new course (2009). It partly overlaps (1p = 1.5hp) with Advanced Parallel Programming which was last given 2007.
This course complements the existing graduate course FDA101 on Parallel Programming (4p), which is aliased to the undergraduate course TDDC78.

Goals
The course emphasizes fundamental aspects of shared-memory parallel programming such as shared memory parallel architecture concepts, programming models, performance models, parallel algorithmic paradigms, parallelization techniques and strategies, scheduling algorithms, optimization, composition of parallel programs, and concepts of modern parallel programming languages and systems. Practical exercises help to apply the theoretical concepts of the course to solve concrete problems in a real-world multicore system.

Prerequisites
Data structures and algorithms (e.g. TDDB57) are absolutely required; knowledge in complexity theory and compiler construction is useful. Some basic knowledge of computer architecture is assumed. A basic course in concurrent programming (e.g. TDDB63/68/72) and parallel programming (e.g. FDA101/TDDC78 or TANA77) are recommended.
Programming in C and some familiarity with Linux (or similar OS) is necessary for the practical exercises.

Contents/Schedule
Lecture block in 2 intensive weeks (2 x 4 days at 4..6 hours) in Linköping. Room: Donald Knuth

>
DayTimeRoom LecturerTopic
9 feb
2009
13:15KnuthC. KesslerOrganization and overview
13:30
-17:00
J. Keller Parallel computer architecture concepts.
Programming with POSIX threads.
10 feb09:15
-12:00
KnuthJ. Keller Parallel sorting;
Graph algorithms for external memory
13:30
-17:00
C. Kessler Cell BE architecture and programming.
(Slides sent to registered participants by e-mail)
18:15
-20:00
Visionen UppLYSning: Introduktion till CUDA. (In swedish).
(optional evening seminar, not a part of the course, but fits well.)
11 feb09:15
-11:00
KnuthD. Liu SIMD architecture and programming
(Slides sent to registered participants by e-mail)
11:15
-12:00,
13:30
-16:15
C. Kessler Repetition of parallel computing theory:
PRAM model. Parallel time, cost, speedup. Amdahl's Law, Gustafsson's Law, Brent's Theorem. Scalability. NC. Basic data-parallel algorithmic building blocks: Parallel prefix sums, list ranking. Delay model. BSP model. LogP model.
Algorithmic design pattern: Parallel divide-and-conquer.
Short introduction to Cilk.
16:30
-17:00
M. Eriksson Lab kick-off: Getting started with Cell programming.
Continue on your own.
12 feb08:30
-10:00
KnuthC. Kessler Parallel programming language concepts.
(Slide set covers language concepts from OpenMP, MPI, Cilk, UPC, Java, etc.; also includes dynamic task scheduling, transactional programming.)
DayTimeRoomLecturerTopic
23 feb
2009
13:15
-17:00
KnuthW. Löwe Bridging models and machines: BSP, LogP.
Simulations between models.
24 feb09:15
-12:00
KnuthW. Löwe Static clustering and scheduling of task graphs. Scheduling malleable tasks.
Data distribution
13:30
-17:00
KnuthH. Sundell Non-blocking Synchronization.
Introduction to the NOBLE library of non-blocking data structures and algorithms.
Lock-free memory
Lock-free SkipList
Lock-free doubly linked list
25 feb09:15
-12:00
KnuthC. Kessler Generic parallel programming with skeletons: Concepts and generative realization for Cell BE.
" " C. Kessler Algorithmic design pattern: On-Chip pipelining for Cell.
13:30
-16:15
KnuthC. Kessler Optimized composition of parallel components.
Autotuning overview. (Have also a look at F. Bodin's HiPEAC'09 keynote slides).
" " C. Kessler Run-time parallelization and speculative parallelization.
Transactional programming. (see earlier slide set, and course book Ch. 18)
16:30
-17:00
KnuthC. Kessler Distribution of papers for student presentations
26 feb09:15
-11:45
KnuthI. Ragnemalm GPU architecture and programming with CUDA and OpenCL.
Slide set: 1.pdf (Intro), 2.pdf (Shaders), 4.pdf (GPGPU).
5.pdf (CUDA/G80) sent by e-mail to all registered participants; here is a shortened version..
11:45
-12:00
KnuthC. Kessler Wrap-up and evaluation.
DayTimeRoomLecturerTopic
6 march
2009
10:15-17:00 Knuth Student presentations
13 march
2009
13:30-17:00 Knuth Written exam
Remark: Breaks are not shown in this schedule, and will be set as appropriate.


Labs

Programming assignments with the CELL BE, a heterogeneous multicore processor.

Prerequisites for the labs: C programming, Linux, and a basic course in concurrent programming and operating systems (e.g. TDDB68).

Literature
As main course book, we will use:

If you would like to do the Cell programming labs, the following book may be useful: Alternatively, the IBM Cell SDK documentation and tutorial material available on the IBM developerWorks site for Cell can be used.

Further literature references will be added later.

Teachers

Examination
TEN1: Written exam (Welf, Christoph) 3p = 4.5hp. Due to partial overlap, 2p = 3hp for those who already have passed the exam in FDA125 Advanced Parallel Programming earlier.

UPG1: Programming exercise (lab report) 1p = 1.5hp.
UPG2: Paper presentation, opposition, and written summary (1-2 pages) of the presented paper, 1p = 1.5hp.

Credit
5p = 7.5hp if all examination moments are fulfilled.
Partial credits can be given, too.

Comments
There is a partial overlap in contents with FDA125 Advanced Parallel Programming, corresponding to 1p = 1.5hp in the written exam.

In contrast to FDA125, we focus in this course on shared memory and multicore systems, also heterogeneous multicore and GPU architectures, and add more material on software engineering techniques for parallel systems, such as design patterns for synchronization and optimized composition of parallel components. We also give more emphasis to parallelization techniques that are especially useful for multicore systems, such as SIMD programming, speculative parallelization, transactional programming and lock-free synchronization. Scheduling for shared memory systems will be considered in more detail, in particular scheduling algorithms for malleable task graphs.


This page is maintained by Christoph Kessler (chrke \at ida.liu.se)