Carnegie Mellon University

Electrical and Computer Engineering

College of Engineering

Course Information

18-749: Building Reliable Distributed Systems




The course provides an in-depth and hands-on overview of designing and developing reliable distributed systems, throughout a system's lifecycle, starting from fault-tolerant design and execution (replication, group communication, databases) to fault-recovery (fault-detection, logging, check-pointing, failure-diagnosis) for various classes of faults (crashes, communication errors, software upgrades). The course will cover real-world practices for reliability, supplemented by case studies of large-scale downtime incidents. The concepts will be taught in the context of contemporary cloud-computing platforms, and the course will include a hands-on project that involves the design, implementation and empirical evaluation of a reliable distributed cloud-based system. Students will be taught to write, review, and present a conference-style research paper by the end of the semester, with the goal of documenting the design, lessons learned and experimental results of their team project. Students can expect to learn about the reliability issues underlying cloud computing, the tools and best practices for implementing and evaluating reliability, and the strengths and weaknesses of current cloud-computing platforms from the perspective of reliability.

Prerequisites: Graduate standing or instructor permission

Last Modified: 2023-07-26 2:54PM

Semesters offered:

  • Fall 2023
  • Fall 2022
  • Fall 2021
  • Fall 2020
  • Fall 2019
  • Fall 2018
  • Fall 2017
  • Fall 2016
  • Fall 2015
  • Fall 2010
  • Spring 2006
  • Spring 2005
  • Spring 2004
  • Spring 2003
  • Spring 2002
  • Fall 1999
  • Fall 1998
  • Spring 1997