Chapter 2. Do We Know Why We Really Want Reliability?

Niall Murphy

Do we really understand reliability, or why we would want it?

This may seem like a strange question. It is an article of faith in this community that unreachable online services have no value. But even a moment’s thought will show you that’s simply not true. You yourself encounter intermittent computer failure almost every day. Some contexts even seem to expect it; with web services, users are highly accustomed to hitting refresh or (for more difficult problems) clearing cookies, restarting a browser, or restarting a machine. Even services themselves have retry protocols.

A certain amount of fudge is baked into every human–computer interaction. Even for longer outages, people almost always come back if you’re down for a few minutes, and have even more patience, depending on the uniqueness of the service provided.

It’s anecdotal, but suggestive: I had a conversation with a very well-known company a couple of years ago when they said they didn’t put any money into reliability because their particular customer base had nowhere else to go. Therefore, time they spent on reliability would be time they wouldn’t spend on capturing revenue; it wasn’t worth it.

I gasped inwardly at the time, but I’ve thought about it often since, and I turn the question toward us, as a community, now: do we have any real argument ...

Get 97 Things Every SRE Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.