Skip to content
  • Sign In
  • Try Now
View all events
Site Reliability Engineering (SRE)

Chaos engineering: Planning and running your first game day

Published by O'Reilly Media, Inc.

Intermediate content levelIntermediate

Modern systems need to be reliable, resilient, robust…and continuously changing. Under these conditions, failure is a normal state for the infrastructure, platforms, and applications that make up a production system. Chaos engineering is a disciplined approach to turning that failure to your advantage, enabling you to inject controlled, preemptive failure into your systems so that you can surface and overcome weaknesses before your customers encounter them.

Join expert Russ Miles to learn how to adopt and apply the mindset and practices of a successful chaos engineer. Through lectures, practical examples, and hands-on exercises, you'll discover how to turn system failure into opportunities for learning as you successfully plan and execute your first game day, a collaborative exercise in which you deliberately place your systems—people, practices, processes, and technology—under stress in order to explore and overcome weaknesses to improve resiliency.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

  • Why you can't prove system reliability in advance
  • The purpose and limitations of chaos engineering
  • How to explain the value of chaos engineering to your company
  • The purpose of game day exercises

And you’ll be able to:

  • Plan and execute a successful game day to explore system weaknesses at the infrastructure, platform, and application levels
  • Enable appropriate system observability to support chaos engineering
  • Communicate and share the findings from a game day to enable prioritized system improvement

This live event is for you because...

  • You're a software developer who needs to start taking responsibility for your code in production.
  • You're a site reliability engineer (SRE) with a little experience managing production, and you want to be proactive about finding system weaknesses before your customers do.
  • You're a system administrator who is responsible for the availability of production, and you need a proactive technique for surfacing system weaknesses before your customers experience them.
  • You're a product owner who is responsible for delivering a business-critical product or service, and you want to learn how to gain trust and confidence in your system’s reliability.
  • You're a DevSecOps engineer who needs a technique and tools to support discovering, capturing, sharing, and collaborating on security weaknesses.

Prerequisites

  • A general understanding of Kubernetes as a platform and Java

Materials or downloads needed in advance:

  • Visit the course website and follow the precourse instructions
  • Download the game day template (link TBD)
  • Sign up for the course Slack channel #chaosengoreilly

Recommended preparation:

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

The chaos engineering mindset (20 minutes)

  • Lecture: Introduction to the sociotechnical system; the challenges of production; how the Cynefin model of systems proves we can't be sure of system reliability and resilience; trust and confidence; the chaos engineering mindset; how to distill outages into chaos engineering
  • Hands-on exercises: Use the outage template to explore and document findings from a production outage for chaos engineering

Attacks on reliability and resiliency (20 minutes)

  • Lecture: Why chaos engineering is a proactive approach to building trust and confidence in system reliability and resilience; continuous limited-scope disaster recovery; commoditized disaster recovery; the various levels of attack on reliability and resilience
  • Hands-on exercises: Brainstorm and share how you might “prove” your systems are resilient and reliable; explore various attacks on system reliability and resilience

Defining resilience and reliability (10 minutes)

  • Lecture: Reliability and resilience; the “premortem” and how it relates to incident postmortems

Break (10 minutes)

Introduction to game days (50 minutes)

  • Lecture: Game day basic concepts; deciding who attends your game day; to surprise, or not to surprise?; building a hypothesis; understanding and introducing observability; defining your method; defining remediation actions
  • Hands-on exercise: Construct a plan for assessing and improving the observability of your system; design a game day using the game day template

Break (10 minutes)

Learning from your game day (50 minutes)

  • Lecture: How to avoid resistance to game days; the crucial characteristics of collaboration and empathy to the chaos engineer; the ethics of chaos engineering and why it can mean the difference between success and failure in your organization; the 24-hour rule on ideas for solutions to discovered weaknesses; the limitations of game days and how they relate to automated chaos experiments; the power of measuring resiliency through mean time to detect, mean time to diagnose, mean time to recovery, and mean time to all clear; why you should be careful not to rely too much on these statistics
  • Hands-on exercise: Take the findings from a real-world game day and convert them into a plan for system improvement, along with appropriate metrics; work through the findings from multiple game days to build a roadmap for system improvement; identify a list of candidates for further exploration through continuous chaos and automation

Wrap-up and Q&A (10 minutes)

Your Instructor

  • Russ Miles

    A self-confessed polyglot programmer, Russ Miles is head of engineering at Crown Agents Bank. To ensure that he has as little spare time as possible, Russ contributes to various open source projects and has authored or coauthored a number of books, including AspectJ Cookbook, Learning UML 2.0, and Head First Software Development, all for O’Reilly. Previously, Russ gained experience of enterprise development throughout all tiers of application architecture, including high performance and usability presentation tier services for the search and mobile portal industries, right through to maximum availability application and data services for the defense industry. Russ holds an MSc in software engineering from Oxford University.

    linkedinXlinksearch