Distributed Algorithms for Fault ToleranceDF15500, 2012VT
8h-10h lectures, combined with some discussion seminars (tutorials) based on student questions.
The course was last given
To get an insight into the history of development and importance of fault tolerance algorithms for distributed systems. To understand what are the major classifications, concepts and terms which define classes of distributed systems, fault tolerance problems therein and solutions to some of these problems: Elements of robust algorithms, in particular concensus and broadcast algorithms, group mechanisms and stablising algorithms. To understand the underlying failure models for which strong results are available. To understand major limitations of achieving fault tolerance with the help of robust algorithms in an asynchronous setting. To study a well-known problem that is solvable within a synchronous setting: Byzantine agreement. To get an overview of stablising algorithms and to study examples of such algorithms.
Students who have satisfied the learning goals of core CUGS courses.
Introduction to the course, basic notions in fault tolerance (FT) and a short review of the area in the intersection of FT and formal analysis of systems. Introduction to basic notions in distributed systems, including local/global state, cut and mechanisms for broadcast, overview on replication models. Consensus, and the related problems, the main impossibility results. Unreliable failure detectors. Group communication, view synchronous broadcast. Partitions and group management. Self-stablising algorithms.
Lectures and tutorial sessions
Introduction to distributed algorithms, Gerard Tel, Cambridge press. and research papers.
Simin Nadjm-Tehrani, Mikael Asplund or Ulf Nilsson (tentative).
Home individual exercises.
(will be only given if at least 6 students are interested)
Page responsible: Director of Graduate Studies
Last updated: 2012-05-03