Chapter 2. Practicing Incident Response Readiness (Preparedness)
We talked about the stages of managing an incident and the incident management lifecycle. Now let’s discuss how to practice incident management so that you can be ready when a real incident strikes.
Disaster Role-Playing and Incident Response Exercises
There is value in testing and practicing incident response readiness in order to increase resilience. We recommend implementing disaster role-playing in your team to train for incident response. At Google, we often refer to this as Wheel of Misfortune.1 One way to do this is to re-create scenarios from real production incidents you encountered in the past.
There are tangible benefits to running regular incident response exercises. In the earlier days of Google’s Disaster Resilience Testing (DiRT) program, there were tests deemed too risky to be executed. Over the years, by focusing on the areas exposed by those too-risky-to-run tests, many of these risks have been addressed so thoroughly that the tests are now automated and considered uninteresting.
Getting to that point wasn’t immediate or painless—it took time and a lot of effort from several teams to get there—but we’ve been able to reduce significant risks in the global system to “just another automated test that runs periodically.”2
Regular Testing
There are tangible benefits to regular testing. For years, Google has been running DiRT tests to find and remediate problems with our production systems. As teams ...
Get Anatomy of an Incident now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.