Chapter 1. Site Reliability Engineering in Six Words

Alex Hidalgo

When someone I’ve just met asks me what I do for a living, I generally fall back to something along the lines of, “I’m a site reliability engineer. We keep large-scale computer services reliable.” For many people, this is sufficiently boring and our general pleasantries continue. Occasionally, though, I run into people who are a bit more curious than that: “Oh, that sounds interesting! How do you do that?”

That’s a difficult question to answer! What do SREs actually do? For many years, I’d rely on just listing an assortment of things—some of which have made their way into essays in this very book. Although an answer like that wasn’t exactly wrong, it also never felt truly satisfying. There had to be a more cohesive answer, and when I reflect on my decade of performing this job, I think I’ve finally figured it out. Virtually everything SREs do relies on our ability to do six things: measure, analyze, decide, act, reflect, and repeat.

Measuring does not just mean collecting data. To measure something, you have some sort of goal in mind. You don’t collect flour to bake a cake, you measure the flour; otherwise, things will end up a mess. SREs need to measure things because pure data isn’t enough. Our data needs to be meaningful. We need to be able to answer the question, “Is this service doing what its users need it to be doing?”

Once you have measurements, the next step is to analyze them. This is when some basic statistics and probability analysis can be helpful. Learn as much as you can from the things you are measuring by using the centuries of study and knowledge mathematicians have made available to us.

Now you’ve done your best at measuring and analyzing how a certain thing is behaving. Use this analysis to make a decision about how best to move into the future!

Then you must act. You actually need to do the thing you decided to do. It could be that this action is actually to take no action at all!

Finally, reflect on what you did once you’ve done it. Place a critical—but blameless—eye squarely on whatever you’ve done. You can generally learn much more from this process than you can from your initial measurement analysis.

Now you start over. Something has either changed about the world due to your decision or it hasn’t, and you need to keep measuring to see what the real impact of this action, or inaction, actually was. Keep measuring and then analyze, decide, act, reflect, and repeat again and again. It’s the SRE way. Incremental progress is the only reliable way to reliability.

Site reliability engineering is a broad discipline. We are often called on to be software engineers, system administrators, network engineers, systems architects, and even educators or consultants, but one paradigm that flows through all of those roles is that SRE is data-driven. Measure the things you need to measure, analyze the data you collect, decide what to do with this analysis, act on your findings, reflect on your decision, and then do it all over, again and again and again.

Measure, analyze, decide, act, reflect and repeat: that’s site reliability engineering in six words.

Get 97 Things Every SRE Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.