Chapter 42. REvisiting the Rs of SRE

J. Paul Reed

One of the hottest topics in site reliability engineering right now is how to make your applications and services resilient in the face of failure. And, as most site reliability engineers know, one of the main arguments for moving those services into the cloud is its much-touted robustness; but, of course, we must think differently about how we architect our applications if we expect them to “automagically” rebound when something goes amiss in the technological sky.

Engineers frequently run into these R-words in discussions on how to develop and operate in the cloud. Hearing them so often, you might have started to wonder: don’t they all sorta…mean the same thing? Fear not: resilience engineering (RE) is here to help clarify all those Rs!

Resilience engineering has existed as a subdiscipline within the safety sciences for over two decades; practitioners recently started to apply its concepts to our industry, looking at how human factors, ergonomics, and “safety” relate to improving the functioning of the web-scale systems that developers and operations engineers wrangle with daily. A major point of examination is the contributions we messy humans make to our systems.

In resilience engineering, those R-words refer to specific (and different) aspects of the socio-technical systems within ...

Get 97 Things Every Cloud Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.