TDDD56 Multicore and GPU Programming

Literature

Reading directions per lecture topic can be found here.

Books

Since the 2018 edition of the course, we have two new course compendia that cover some (but not all) of the lectures of the course. These are made available to registered course participants via the course homepage.

C. Kessler: Design and Analysis of Parallel Algorithms: An Introduction. Compendium, version Sep. 2024, (c) 2020-2024.
Available for registered participants of TDDD56. Not for general distribution.
Covers the TDDD56 lectures on: multicore architecture concepts, models of parallel computing, design and analysis of parallel algorithms, parallel sorting, and automatic (compiler) optimizations and parallelization.
I. Ragnemalm: Attack in Packs. Compendium, (c) 2018.
Available for registered participants of TDDD56. Not for general distribution.
Covers the TDDD56 GPU lectures.

So far, there is no single textbook (yet) that covers the entire course contents.
In principle you should be able to follow the course based on the OH material, the two course compendia, and the online references (e.g., articles) given further below.
If you prefer to work with a textbook, we recommend that you choose one general introductory text to parallel programming (not needed if you already have taken TDDE65/TDDC78 earlier), such as Wilkinson/Allen or Lin/Snyder, and possibly complement this with a book or online material about GPU programming. If you want to work with a GPU programming book, too, we recommend, in the first hand, Sanders/Kandrot CUDA by example, see below. We also give a selection of reference books for secondary reading below.
All books listed below should be available in the LiU Campus Valla library, some also as reference copy or e-books.

C. Lin, L. Snyder: Principles of Parallel Programming. Pearson/Addison Wesley, 2008. 978-0-321-54942.
(General introduction text. The Campus-Valla library has a copy for loan and will also have a reference copy or e-book. There is an Errata available for the first printing.)
T. Mattson, B. Sanders, B. Massingill: Patterns for Parallel Programming. Addison Wesley, 2005. 978-0-321-22811-6
(More a survey, could be useful as a secondary book. The Campus-Valla library has a copy for loan.)
M. Herlihy and N. Shavit: The Art of Multiprocessor Programming. Morgan Kaufmann Publishers, 2008. 978-0-12-370591-4
(More advanced issues in shared-memory synchronization, esp. about non-blocking concurrent data structures. The Campus-Valla library has one copy for loan and also as e-book.)
B. Wilkinson, M. Allen: Parallel Programming, Second Edition. Pearson/Prentice Hall, 2005. 0-13-140563-2
(General introduction text. The course book in TDDC78. The Campus-Valla library has one reference copy and one copy for loan.)
J. Sanders, E. Kandrot: CUDA by example. Addison-Wesley, 2011. 978-0-13-138768-3
(Recommended further reading, for GPU programming in CUDA. The Campus-Valla library has one copy for loan.)
David B. Kirk and Wen-mei W. Hwu: Programming Massively Parallel Processors: A Hands-on Approach.
Morgan Kaufmann, 2010 (first edition), 2012 (second edition). ISBN 0123814723
(Only secondary reading, for GPU programming in CUDA. Available in the Campus-Valla library as e-book.)
A. Grama, G. Karypis, V. Kumar, A. Gupta: Introduction to Parallel Computing, 2nd Edition. Addison-Wesley, 2003.
(General introduction text with focus on parallel algorithms. The Campus-Valla library has a copy for loan.)
A. Munshi, B. Gaster, T. Mattsson, J. Fung, D. Ginsburg: OpenCL Programming Guide. Addison-Wesley, 2011. Print-edition 978-0-321-74964-2
(Background reading on OpenCL. The Campus-Valla library has acquired one copy for loan.)
B. Gaster, L. Howes, D. Kaeli, P. Mistry Heterogeneous Computing with OpenCL. Morgan Kaufmann Publishers, 2011, 978-0123877666
(Background reading on OpenCL. The Campus-Valla library will acquire one copy for loan.)

On-line material and references

On shared-memory parallel architecture concepts:

J. Hennessy, D. Patterson: Computer Architecture, A Quantitative Approach, 4th ed.. Morgan Kaufmann, 2007.
The second edition (1996), third edition (2002) and fourth edition (2007) each contain a chapter on multiprocessor architecture. The current fifth edition should also be fine. The Campus-Valla library has copies.
As further reading:
D. Culler, J. Singh, A. Gupta: Parallel Computer Architecture. Morgan Kaufmann, 1998. Contains an in-depth treatment of SMP and CC-NUMA shared-memory architectures.

On non-blocking synchronization:

Adam Morrison: Scaling Synchronization in Multicore Programs. Communications of the ACM 59(11), Nov. 2016.
Maged M. Michael: The Balancing Act of Choosing Nonblocking Features. Communications of the ACM 56(9), Sep. 2013.

On the design and analysis of parallel algorithms:

W. Hillis, G. Steele: Data parallel algorithms. Comm. ACM 29(12), Dec. 1986
C. Kessler, J. Keller: Models for parallel computing: Review and Perspectives. PARS-Mitteilungen 24: 13-29, ISSN 0177-0454, Dec. 2007. Gesellschaft für Informatik e.V., Germany.
J. Keller, C. Kessler, J. Träff:
Practical PRAM Programming. Wiley Interscience, 2001.
Chapter 2 gives a summary of the PRAM model and introduces basic theory of parallel processing (Time, work, cost, Amdahl's law, Brent's theorem, fundamental parallel algorithms). The book is available in the Campus-Valla library.
M. Hill, M. Marty: Amdahl's Law in the Multicore Era. IEEE Computer 41(7): 33-37, July 2008.

On GPU architecture and programming:

M. Garland, D. Kirk: Understanding Throughput-Oriented Architectures. Communications of the ACM 53(11), Nov. 2010.
J. A. Stratton et al.: Algorithm and data optimization techniques for scaling to massively threaded systems. IEEE Computer 44(8), Aug. 2012, pp. 26-32.
J. Lebar (Google): Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler (Presentation at Cppcon 2016, Youtube). Contains an introduction to GPGPU architecture and programming from a CPU perspective, which can be of interest for TDDD56 as extra material. The presentation also explains how CUDA compilation works.
M. Zahran: Heterogeneous Computing is Here to Stay. Communications of the ACM 60(3):42-45, 2017.

On programming and optimization techniques for multicore CPU architectures:

N. Satish et al.: Can traditional programming bridge the Ninja performance gap for parallel computing applications? Comm. of the ACM 58(5), May 2015, pp. 77-86.

This list may be extended during the course.

Reading Directions per Lecture/Topic

The following list gives directions to background reading especially where the information given on the slides should not suffice.
Bibliographical details of the referenced textbooks can be found in the literature list above.

Background knowledge (Prerequisites)
Not all participants of this course have a core computer science background, hence some technical terms might not be well known to everyone even if the topic is listed in the course prerequisites in the syllabus. Nevertheless, here are some links to catch up before / during the first days of the course.

Big-O notation is covered in Appendix Chapter A of C. Kessler's compendium.
See also this Slide set on the analysis of sequential algorithms and Big-O notation from an earlier course in data structures and algorithms.
Central terms about concurrent programming (processes, threads, race conditions, synchronization, mutual exclusion, locks, deadlocks, ...) are shortly recapitulated and explained in the slide set of the third lecture, even though there is no time to go through all this in detail again in TDDD56. If you need more background reading about these concepts, read e.g. Chapters 3, 4 and 6 of (any recent edition of) Galvin, Silberschatz, Gagne: Operating System Concepts, or any other good textbook on operating systems concepts.
If there are any further items that are new to you but not to most others, please contact me in the lecture break.

On multicore architecture concepts and trends (Lecture 1)

Appendix-Chapter B in C. Kessler's compendium (new 2024 edition)
Chapter 2 of Lin/Snyder (until p. 53) provides some introductory material about multicore architecture concepts. (There is nothing about multicore and accelerator hardware concepts in the Wilkinson/Allen book.)
For a more in-depth review of general concepts in (parallel) computer architecture, such as VLIW, hardware multithreading, multicore, caches, (shared) memory etc., see e.g. Hennessy, Patterson: Computer Architecture, a quantitative approach, 4th edition 2006, 5th edition 2011, or later, or other modern textbooks about computer architecture.

On shared memory architecture, cache coherence, false sharing etc. (Lecture 2)

Appendix-Chapter B in C. Kessler's compendium (new 2024 edition)
The MSI and MESI coherence protocols (bus snooping) are described e.g. in Section 5.3 of the book by Culler, Pal Singh and Gupta, Parallel Computer Architecture, 1998.
Consistency models are summarized e.g. in Wilkinson/Allen Parallel Programming 2e, Section 9.3.

On programming with threads and tasks (Lecture 3)

A survey of shared memory programming (processes, threads, critical sections, locks, thread pools, OpenMP) is given e.g. in Wilkinson/Allen Parallel Programming 2e, Chapter 8, or in Chapter 6 of Lin/Snyder Principles of Parallel Programming.
Dynamic load balancing is described e.g. in Wilkinson/Allen Parallel Programming 2e, Section 7.2.
Programming with light-weight tasks (a la Cilk and Futures) and dynamic scheduling by work-stealing is shortly presented e.g. in Herlihy/Shavit The Art of Multiprocessor Programming, Chapter 16.

On non-blocking synchronization (Lecture 4)

Chapters 9, 10 and 11 of Herlihy/Shavit The Art of Multiprocessor Programming give more information about lock-free data structures (linked lists, stacks, queues), non-blocking synchronization, and the ABA problem.

On design and analysis of parallel algorithms (Lecture 5-6)

This material is completely covered in C. Kessler's compendium.
Most of the theory, including a proof of Brent's Theorem, is presented e.g. in Chapter 2 of Keller, Kessler, Träff: Practical PRAM Programming, Wiley 2001.
Speedup etc. is introduced e.g. in Wilkinson/Allen Parallel Programming 2e, Section 1.1, or in Lin/Snyder Principles of Parallel Programming, Chapter 3.
The EREW PRAM Prefix Sums algorithm is described e.g. in the Hillis/Steele paper on Data-parallel algorithms or in Wilkinson/Allen Parallel Programming 2e, Section 6.2.1.
More about parallel prefix algorithms can be found e.g. in the book by JaJa, or in Chapter 2 of Jordan and Alaghband Fundamentals of Parallel Processing.
If you need to recap on big-O notation for asymptotic growth of functions, check Appendix A in C. Kessler's compendium.

On parallel sorting (Lecture 7)

This material is completely covered in C. Kessler's compendium.
Chapter 10 of Wilkinson/Allen Parallel Programming 2e contains a presentation of several parallel sorting algorithms.

On parallel patterns and skeleton programming (Lecture 8)

Parallel patterns are partly covered in Chapter 1 of C. Kessler's compendium.
Information about programming in SkePU can be found on the SkePU web page.

On dependence analysis, loop transformations and parallelization (Lecture 13)

Chapter 5 in C. Kessler's compendium (new 2024 edition)
Beyond that, we refer for further details to a standard textbook such as: R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

Page responsible: Christoph W Kessler
Last updated: 2024-10-30

IDA - Department of Computer and Information Science

TDDD56 Multicore and GPU Programming

Literature

Books

On-line material and references

Reading Directions per Lecture/Topic