TDDD56 Multicore and GPU Programming
Reading directions per lecture topic can be found here.
For the 2018 edition of the course, we will have two new course compendia (2018) that cover some (but not all) of the lectures of the course. These are made available to registered course participants via the course homepage.
- C. Kessler: Design and Analysis of Parallel Algorithms:
Compendium, (c) 2018.
Available for registered participants of TDDD56 (2018). Not for general distribution.
Covers the three TDDD56 lectures on analysis of parallel algorithms and on parallel sorting.
- I. Ragnemalm: Attack in Packs. Compendium, (c) 2018.
Available for registered participants of TDDD56 (2018). Not for general distribution.
Covers the TDDD56 GPU lectures.
So far, there is no single textbook (yet) that
covers the entire course contents.
In principle you should be able to follow the course based on the OH material, the two course compendia, and the online references (e.g., articles) given further below.
If you prefer to work with a textbook, we recommend that you choose one general introductory text to parallel programming (not needed if you already have taken TDDC78 earlier), such as Wilkinson/Allen or Lin/Snyder, and possibly complement this with a book or online material about GPU programming. If you want to work with a GPU programming book, too, we recommend, in the first hand, Sanders/Kandrot CUDA by example, see below. We also give a selection of reference books for secondary reading below.
All books listed below should be available in the Campus Valla library (D-house), some also as reference copy or e-books.
C. Lin, L. Snyder: Principles of Parallel Programming.
Pearson/Addison Wesley, 2008. 978-0-321-54942.
(General introduction text. The Campus-Valla library has a copy for loan and will also have a reference copy or e-book.)
T. Mattson, B. Sanders, B. Massingill:
Patterns for Parallel Programming.
Addison Wesley, 2005. 978-0-321-22811-6
(More a survey, could be useful as a secondary book. The Campus-Valla library has a copy for loan.)
M. Herlihy and N. Shavit:
The Art of Multiprocessor Programming.
Morgan Kaufmann Publishers, 2008. 978-0-12-370591-4
(More advanced issues in shared-memory synchronization, esp. about non-blocking concurrent data structures. The Campus-Valla library has one copy for loan and also as e-book.)
B. Wilkinson, M. Allen:
Parallel Programming, Second Edition.
Pearson/Prentice Hall, 2005. 0-13-140563-2
(General introduction text. The course book in TDDC78. The Campus-Valla library has one reference copy and one copy for loan.)
J. Sanders, E. Kandrot:
CUDA by example.
Addison-Wesley, 2011. 978-0-13-138768-3
(Recommended further reading, for GPU programming in CUDA. The Campus-Valla library has one copy for loan.)
David B. Kirk and Wen-mei W. Hwu:
Programming Massively Parallel Processors: A Hands-on Approach.
Morgan Kaufmann, 2010 (first edition), 2012 (second edition). ISBN 0123814723
(Only secondary reading, for GPU programming in CUDA. Available in the Campus-Valla library as e-book.)
A. Grama, G. Karypis, V. Kumar, A. Gupta:
Introduction to Parallel Computing, 2nd Edition.
(General introduction text with focus on parallel algorithms. The Campus-Valla library has a copy for loan.)
A. Munshi, B. Gaster, T. Mattsson, J. Fung, D. Ginsburg:
OpenCL Programming Guide.
Addison-Wesley, 2011. Print-edition 978-0-321-74964-2
(Background reading on OpenCL. The Campus-Valla library has acquired one copy for loan.)
B. Gaster, L. Howes, D. Kaeli, P. Mistry
Heterogeneous Computing with OpenCL.
Morgan Kaufmann Publishers, 2011, 978-0123877666
(Background reading on OpenCL. The Campus-Valla library will acquire one copy for loan.)
On-line material and referencesOn shared-memory parallel architecture concepts:
- J. Hennessy, D. Patterson: Computer Architecture, A Quantitative Approach, 4th ed.. Morgan Kaufmann, 2007. The second edition (1996), third edition (2002) and fourth edition (2007) each contain a chapter on multiprocessor architecture. The Campus-Valla library has copies. As an alternative:
- D. Culler, J. Singh, A. Gupta: Parallel Computer Architecture. Morgan Kaufmann, 1998. Contains an in-depth treatment of SMP and CC-NUMA shared-memory architectures.
- Adam Morrison: Scaling Synchronization in Multicore Programs. Communications of the ACM 59(11), Nov. 2016.
- Maged M. Michael: The Balancing Act of Choosing Nonblocking Features. Communications of the ACM 56(9), Sep. 2013.
- W. Hillis, G. Steele: Data parallel algorithms. Comm. ACM 29(12), Dec. 1986
- C. Kessler, J. Keller: Models for parallel computing: Review and Perspectives. PARS-Mitteilungen 24: 13-29, ISSN 0177-0454, Dec. 2007. Gesellschaft für Informatik e.V., Germany.
- J. Keller, C. Kessler, J. Träff:
Practical PRAM Programming. Wiley Interscience, 2001.
Chapter 2 gives a summary of the PRAM model and introduces basic theory of parallel processing (Time, work, cost, Amdahl's law, Brent's theorem, fundamental parallel algorithms). The book is available in the Campus-Valla library.
- M. Hill, M. Marty: Amdahl's Law in the Multicore Era. IEEE Computer 41(7): 33-37, July 2008.
- M. Garland, D. Kirk: Understanding Throughput-Oriented Architectures. Communications of the ACM 53(11), Nov. 2010.
- J. A. Stratton et al.: Algorithm and data optimization techniques for scaling to massively threaded systems. IEEE Computer 44(8), Aug. 2012, pp. 26-32.
- J. Lebar (Google): Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler (Presentation at Cppcon 2016, Youtube). Contains an introduction to GPGPU architecture and programming from a CPU perspective, which can be of interest for TDDD56 as extra material. The presentation also explains how CUDA compilation works.
- M. Zahran: Heterogeneous Computing is Here to Stay. Communications of the ACM 60(3):42-45, 2017.
- N. Satish et al.: Can traditional programming bridge the Ninja performance gap for parallel computing applications? Comm. of the ACM 58(5), May 2015, pp. 77-86.
Reading Directions per Lecture/Topic
The following list gives directions to background reading
especially where the information given on the slides should not suffice.
Bibliographical details of the referenced textbooks can be found in the literature list above.
Background knowledge (Prerequisites)
Not all participants of this course have a core computer science background,
hence some technical terms might not be well known to everyone even if
the topic is listed in the course prerequisites in the syllabus.
Nevertheless, here are some links to catch up before / during the first days
of the course.
- Slide set on the analysis of sequential algorithms and Big-O notation from an earlier course in data structures and algorithms.
- Central terms about concurrent programming (processes, threads, race conditions, synchronization, mutual exclusion, locks, deadlocks, ...) are recapitulated and explained in the slide set of the third lecture, even though there is no time to go through all this in detail again in TDDD56. If you need more background reading about these concepts, read e.g. Chapters 3, 4 and 6 of (any recent edition of) Galvin, Silberschatz, Gagne: Operating System Concepts, or any other good textbook on operating systems concepts.
- If there are any further items that are new to you but not to most others, please let us know.
On multicore architecture concepts and trends (Lecture 1)
- Chapter 2 of Lin/Snyder (until p. 53)
provides some introductory material about multicore architecture concepts.
(There is nothing about multicore and accelerator hardware concepts in
the Wilkinson/Allen book.)
For a more in-depth review of general concepts in (parallel) computer architecture, such as VLIW, hardware multithreading, multicore, caches, (shared) memory etc., see e.g. Hennessy, Patterson: Computer Architecture, a quantitative approach, 4th edition 2006, 5th edition 2011, or later, or other modern textbooks about computer architecture.
On shared memory architecture, cache coherence, false sharing etc. (Lecture 2)
- The MSI and MESI coherence protocols (bus snooping) are described e.g. in Section 5.3 of the book by Culler, Pal Singh and Gupta, Parallel Computer Architecture, 1998.
- Consistency models are summarized e.g. in Wilkinson/Allen Parallel Programming 2e, Section 9.3.
On programming with threads and tasks (Lecture 3)
- A survey of shared memory programming (processes, threads, critical sections, locks, thread pools, OpenMP) is given e.g. in Wilkinson/Allen Parallel Programming 2e, Chapter 8, or in Chapter 6 of Lin/Snyder Principles of Parallel Programming.
- Dynamic load balancing is described e.g. in Wilkinson/Allen Parallel Programming 2e, Section 7.2.
- Programming with light-weight tasks (a la Cilk and Futures) and dynamic scheduling by work-stealing is shortly presented e.g. in Herlihy/Shavit The Art of Multiprocessor Programming, Chapter 16.
On non-blocking synchronization (Lecture 4)
- Chapters 9, 10 and 11 of Herlihy/Shavit The Art of Multiprocessor Programming give more information about lock-free data structures (linked lists, stacks, queues), non-blocking synchronization, and the ABA problem.
On design and analysis of parallel algorithms (Lecture 5-6)
- Most of the theory, including a proof of Brent's Theorem, is presented e.g. in Chapter 2 of Keller, Kessler, Träff: Practical PRAM Programming, Wiley 2001.
- Speedup etc. is introduced e.g. in Wilkinson/Allen Parallel Programming 2e, Section 1.1, or in Lin/Snyder Principles of Parallel Programming, Chapter 3.
- The EREW PRAM Prefix Sums algorithm is described e.g. in
the Hillis/Steele paper on
Wilkinson/Allen Parallel Programming 2e, Section 6.2.1.
More about parallel prefix algorithms can be found e.g. in the book by JaJa, or in Chapter 2 of Jordan and Alaghband Fundamentals of Parallel Processing.
On parallel sorting (Lecture 7)
- Chapter 10 of Wilkinson/Allen Parallel Programming 2e contains a presentation of several parallel sorting algorithms.
Page responsible: Christoph W Kessler
Last updated: 2018-08-31