Systems design for site reliability engineers
How to build a reliable system in three hours

Distributed systems form the foundation for most of our modern computing infrastructure as well as much of our application development—whether on-premises or mobile. The software built with distributed systems comes with distinct failure modes. In order to build reliable systems, you must understand how to assess and develop with these modes.
In this hands-on three-hour course, Salim Virji walks you through the fundamentals of systems design and evaluation, helping you build the skills necessary to design, improve, and scale your own system or application using SRE best practices developed at Google.
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- How to design a software system to meet a service-level objective (SLO)
- How to incrementally improve a system
- How to identify single points of failure (SPOFs) in a large software system
And you’ll be able to:
- Make required resource estimates to create a bill of materials
- Incrementally scale a system
This training course is for you because...
- You’re a site reliability engineer (SRE) or work in a related discipline, such as DevOps, systems engineering, or system administration.
- You manage SREs.
- You want to develop an understanding of practical distributed systems.
Prerequisites
- Familiarity with “box and arrows” diagrams
- A working knowledge of orders-of-magnitude math (e.g., How many copies of a 1 MB file can a 1 TB drive hold?)
Recommended preparation:
- Read "Introducing Non-abstract Large System Design" (chapter 12 in The Site Reliability Workbook)
- Read "Service Level Objectives" (chapter 4 in Site Reliability Engineering)
Recommended follow-up:
- Read Distributed Systems Observability (report)
- Watch Apache ZooKeeper and The Art of Building Distributed Systems (webcast recording)
- Watch The Distributed Systems Video Collection (videos)
About your instructor
-
Salim Virji is a site reliability engineer at Google, where he has built distributed systems that enable planet-scale storage and datacenter-size compute loads.
Schedule
The timeframes are only estimates and may vary according to how the class is progressing
Identify the problem (50 minutes)
- Lecture: Problem statement—We're building an image-serving application; terminology and concepts; service-level objectives
- Hands-on exercise: Design a distributed system
- Q&A
- Break (10 minutes)
The solution has limitations. Let’s improve it (50 minutes)
- Lecture: How to quantitatively assess the failure domains in a distributed system; how to provide defense in depth so that failures are isolated
- Group discussion: Where are the failure domains?
- Hands-on exercises: Identify failure domains; make the design tolerant to failure; make a highly available image-serving system
- Q&A
- Break (10 minutes)
Commonly encountered limitations and how to design for them (50 minutes)
- Lecture: Capacity limitations, bottlenecks, and compromises; the boundaries of a system; how to decide when further scale is important
- Group discussion: Designing for 10x scale (and why this is a good rule of thumb)
Wrap-up and Q&A (10 minutes)