|
Important message (Nov 2012): Due to heavy work load and unexpected changes of course responsibilities, this Multicore Computing course, originally announced for spring 2013, will be postponed until (early) fall 2013 (probably, end of august and early september for the lecture blocks, mid september for the labs). |
Recommended for
Graduate (CUGS, CIS, ...) students interested in the areas of
parallel computer architecture, parallel programming, software engineering,
optimization, compiler construction, or algorithms and complexity.
The course is hosted by the
CUGS graduate school
as an Advanced Course.
Interested undergraduate students are also welcome, as long as there are
free seats left.
Registration (internal access only).
(If you have no IDA research account,
please contact the CUGS secretary, Anne Moe (annes \at ida.liu.se)
for registration.)
List of participants 2011 (internal access only).
Organization
Lectures (ca. 32h),
programming exercises, optional
theoretical exercises for self-assessment, programming lab assignment,
student presentations.
The course will be held as block course with two intensive weeks
in Linköping.
The course was last given
in 2011 and 2009. It partly overlaps (1p = 1.5hp) with the older course
Advanced Parallel Programming which was last given 2007.
This course complements the existing
graduate course FDA101 on
Parallel Programming (4p), which is aliased to the undergraduate course
TDDC78.
Goals
The course emphasizes fundamental aspects of shared-memory
and accelerator-based parallel programming, such as
shared memory parallel architecture concepts, programming models,
performance models,
parallel algorithmic paradigms, parallelization techniques and
strategies, scheduling algorithms, optimization, composition of
parallel programs, and
concepts of modern parallel programming languages and systems.
Practical exercises help to apply the theoretical concepts
of the course to solve concrete problems in a real-world
multicore system.
Prerequisites
Data structures and algorithms (e.g. TDDB57) are absolutely required;
knowledge in complexity theory and compiler construction is useful.
Some basic knowledge of computer architecture is assumed.
A basic course in concurrent programming
(e.g. TDDB63/68/72) and parallel programming
(e.g. FDA101/TDDC78 or TANA77) are recommended.
Programming in C and some familiarity with Linux (or similar OS)
is necessary for the practical exercises.
Contents/Schedule
Lecture block in 2 intensive weeks (2 x 4 days at 4..6 hours)
in Linköping.
Weeks 7 (14-17/2/2011) and 9 (28/2-3/3) are confirmed.
Rooms:
Alan Turing
and
John von Neumann
(see the schedule)
| Day | Time | Room | Lecturer | Topic |
| 14 feb 2011 | 13:15-13:30 | A. Turing | C. Kessler | Organization and overview |
| 13:30-17:00 | A. Turing | C. Kessler |
Introduction to Multicore architecture concepts and trends. Programming with POSIX threads. |
|
| 15 feb | 09:15-12:00 | A. Turing | C. Kessler | Review of parallel computing theory: PRAM model. Parallel time, cost, speedup. Amdahl's Law, Gustafsson's Law, Brent's Theorem. Scalability. NC. Basic data-parallel algorithmic building blocks: Parallel prefix sums, list ranking. Delay model. BSP model. LogP model. Algorithmic design pattern: Parallel divide-and-conquer. |
| 13:30-15:00 15:30-17:00 | A. Turing v.Neumann | C. Kessler |
Cell BE architecture and programming. SIMD computing. |
|
| 16 feb | 09:15-12:00 | v.Neumann | C. Kessler | Parallel programming language concepts. |
| 13:30-15:00 | v.Neumann | E. Hansson | Lesson and lab kick-off: Cell programming. (Continue on your own.) |
|
| 15:30-17:00 | v.Neumann | C. Kessler |
Short introduction to Cilk and work-stealing scheduling. Introduction to transactional programming. | |
| 17 feb | 09:15-12:00 | A. Turing | I. Ragnemalm J. Ogniewski |
GPU Architecture, Shader Programming, CUDA. OpenCL. |
| Day | Time | Room | Lecturer | Topic |
| 28 feb 2011 | 13:15-17:00 | A. Turing | H. Sundell |
Non-blocking Synchronization. Introduction to the NOBLE library of non-blocking data structures and algorithms. Lock-free memory Lock-free SkipList Lock-free doubly linked list |
| 1 mar | 09:15-12:00 | A. Turing | J. Keller |
Parallel sorting; Graph algorithms for external memory (briefly); Algorithmic design pattern: On-Chip pipelining for Cell. |
| 1 mar | 13:30-14:30 | v.Neumann | J. Keller | On-line scheduling and scheduling of series-parallel task graphs. |
| 1 mar | 15:00-16:30 | v.Neumann | U. Dastgeer |
Lesson and Lab kick-off: GPU programming with CUDA and OpenCL (continue on your own) |
| 2 mar | 09:15-10:30 | v.Neumann | C. Kessler |
Generic parallel programming with skeletons:
Concepts and generative realization for GPUs with SkePU. Mid-term evaluation. |
| 2 mar | 10:45-12:00 | v.Neumann | W. Löwe | Bridging models and machines: BSP, LogP. |
| 2 mar | 13:15-14:30 | A. Turing | W. Löwe | Simulations between models. |
| 2 mar | 14:45-17:00 | A. Turing | W. Löwe |
Static clustering and scheduling of task graphs. Scheduling malleable tasks. |
| 3 mar | 09:15-10:00 | v.Neumann | E. Hansson | Cell-lab, supervised time / demonstration |
| 3 mar | 10:15-10:45 | v.Neumann | C. Kessler |
Guidelines for Paper Presentations. Distribution of papers for student presentations. Paper assignment |
| 3 mar | 10:45-12:00 | v.Neumann | C. Kessler |
Autotuning overview. Optimized composition of performance-aware parallel components. Performance portability for heterogeneous multicore systems: The PEPPHER Approach. |
| 3 mar | 12:00-12:10 | v.Neumann | C. Kessler | Wrap-up and evaluation. |
| Day | Time | Room | Lecturer | Topic |
| 15 mar | 09:00-13:00 | v.Neumann | Written exam | |
| 23 mar | 09:00-17:00 | A. Turing | Student presentation session | |
| 10 jun | 13:00-17:00 | v.Neumann | Re-exam opportunity |
You can do either the Cell lab or the GPU lab (or both).
The lab moment of the course (1.5hp) is passed
if the Cell lab OR the GPU lab is passed.
An extra 1.5 hp is awarded if you pass both.
Cell-Lab: Programming assignment with the CELL BE, a heterogeneous multicore processor.
Prerequisites for the labs: C programming, Linux, and a basic course in concurrent programming and operating systems (e.g. TDDB68).
Supervision and examination: Erik Hansson
Prerequisites for the labs: C programming, Linux, and a basic course in concurrent programming and operating systems (e.g. TDDB68).
Supervision and examination: Usman Dastgeer
Literature
As main course book, we will use:
Further literature references will be added later.
Teachers
Examination
TEN1: Written exam (Welf, Christoph) 3p = 4.5hp. Due to partial overlap, 2p = 3hp for those who already have passed the exam in FDA125 Advanced Parallel Programming earlier.
UPG2: Paper
presentation, opposition, and written summary
(1-2 pages, deadline 4/4/2011) of the presented paper, 1p = 1.5hp.
Credit
7.5hp if all examination moments are fulfilled, up to 9hp if both labs are passed.
Partial credits can be given, too, but the written exam must be passed.
Comments
There is a partial overlap in contents with the earlier course
FDA125 Advanced Parallel Programming, corresponding to 1p = 1.5hp
in the written exam.
In contrast to FDA125, we focus in this course on shared memory and multicore systems, also heterogeneous multicore and GPU architectures, and add more material on software engineering techniques for parallel systems, such as design patterns for synchronization and optimized composition of parallel components. We also give more emphasis to parallelization techniques that are especially useful for multicore systems, such as SIMD programming, speculative parallelization, transactional programming and lock-free synchronization. Scheduling for shared memory systems will be considered in more detail, in particular scheduling algorithms for malleable task graphs.