Protocols and practice
Dealing with an incident or outage can be one of the most stressful parts of supporting any service. Outages or production incidents negatively affect the business, revenue, user happiness and engineer cortisol levels and the kicker is that they are unavoidable. Outages will occur no matter how resilient you think your system is, there’s not much you can do about that. There are however many practices you can use to alleviate the negative impact and reduce the time it takes to fix an outage.
In this course, you’ll learn the fundamentals of incident management to help you respond to outages quicker, minimize the damage and learn from it to avoid it for the future.
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- How the production incident cycle works
- How to minimize the time to detect an incident using effective alerts, Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
- How to minimize the time to repair an incident
- How to reliably identify when to use incident-management protocols
- How to identify incident commanders, communicators, and other key roles for incident management
- How to maximize the time between outages using postmortems
And you’ll be able to:
- Bring incident management practices to your organization
- Maintain consistency across individuals and teams for incident management practices
- Incorporate Incident Management into healthy postmortem practices
This training course is for you because...
- You’re a site reliability engineer (SRE), or work in a related discipline: DevOps, Systems Engineering, System Administration
- You manage SREs
- Familiarity with an oncall system
- Read Managing Incidents (Chapter 14 in Site Reliability Engineering)
- Read Site Reliability Engineering (book)
About your instructor
Cindy Quach is a Site Reliability Engineer at Google, she’s worked as an SRE on various Google products such as the internal Linux distribution, mobile infrastructure and virtualization teams. She currently works on the customer reliability engineering team where she helps Google Cloud customers adopt SRE practices and principles to help them scale their services.
The timeframes are only estimates and may vary according to how the class is progressing
Part 1: Fundamentals of Incident Management (55 minutes)
- Presentation: Overview of incident management, what is an incident, why we want to reduce the amount of time an incident takes and how we can accomplish that.
- Exercise: Identify good and not so good postmortem action items.
- Break (5 minutes)
Part 2: Practical Incident Management (55 minutes)
- Presentation: Hands-on incident management
- Exercise: Walkthrough an incident using IMAG.
- Break (5 minutes)
Part 3: Incident Management and Beyond (30 minutes)
- Presentation: Overview of other techniques you can use to reduce MTTD, MTTR, and MTBF.