Chapter 5. Google DiRT: Disaster Recovery Testing

“Hope is not a strategy.” This is the motto of Google’s Site Reliability Engineering (SRE) team and it perfectly embodies the core philosophy of Chaos Engineering. A system may be engineered to tolerate failure but until you explicitly test failure conditions at scale there is always a risk that expectation and reality will not line up. Google’s DiRT (Disaster Recovery Testing) program was founded by site reliability engineers (SREs) in 2006 to intentionally instigate failures in critical technology systems and business processes in order to expose unaccounted for risks. The engineers who championed the DiRT program made the key observation that analyzing emergencies in production becomes a whole lot easier when it is not actually an emergency.

Disaster testing helps prove a system’s resilience when failures are handled gracefully, and exposes reliability risks in a controlled fashion when things are less than graceful. Exposing reliability risks during a controlled incident allows for thorough analysis and preemptive mitigation, as opposed to waiting for problems to expose themselves by means of circumstance alone, when issue severity and time pressure amplify missteps and force risky decisions based on incomplete information.

DiRT began with Google engineers performing role-playing exercises1 similar to the Game Days practiced at other companies. They particularly focused on how catastrophes and natural disasters ...

Get Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.