Chapter 46. Move Fast to Unbreak Things
Michelle Brush
As SREs, we see our job as balancing velocity with reliability. We know each change deployed to production, whether code or configuration, carries some risk of causing an outage or other degradation of service. When an outage happens, our immediate reaction is to be more cautious, to slow down production changes.
Then things still break. Despite our efforts, there are still outages. Our instinct was wrong. Things are now more likely to break exactly because we slowed down. When a plane stalls in midair, the natural reaction might be to pull up, to pull away from the ground. The right answer is to point the nose of the plane down and increase engine power. This generates lift. Sometimes the right thing to do is the opposite of what our intuition tells us.
Your development organization is a faucet. It produces change (whether features, bugs, or architectural work) at a somewhat constant rate. Separately, there’s a rate at which those changes can flow into production. The production flow rate is determined by your deployment cadence, the speed of your quality assurance process, any approval requirements, and so on.
What happens when you slow that cadence, whether explicitly by freezing or implicitly through increased review, human checks, or a change approval process? You accumulate a bigger backlog of changes awaiting ...
Get 97 Things Every SRE Should Know now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.