Chapter 10. Monitoring

An unreliable person is nobody’s friend.

Idries Shah, Reflections

There was once a product called Chuck that was so well built it naturally had 99.9999% availability. (Yup, he was practically always ready to handle any requests!)

Chuck lived a peaceful life free of downtime and outages in Production Land. One ordinary day, much like any other, he was minding his own business and strolling down Production Avenue when he suddenly felt a sharp loss of connectivity and had to sit slowly on the sidewalk. Chuck thought to himself, “Is this it? Am I finally falling over?”

Was Chuck experiencing a once-very-distant-memory network outage?!?!!

Chuck in Production Land is no fairy tale; it’s the real-life story of a Google product named Chubby that was very well architected and proved to be so reliable that it led to a false sense of security among its users. They conned themselves into believing that it would never go down and so increased their dependence on it well beyond its advertised, observed, and monitored availability.

We all know that unicorns are mythical creatures that don’t exist, and so is 100% uptime for a software product. Even though Chubby rarely faced any incidents, they still occasionally happened, leading to unexpected (and, more importantly, unplanned for) disruptions in its downstream services.

For Google, the solution to this unicornish scenario was to deliberately bring its own system down often enough to match its advertised uptime, thereby ...

Get Building Green Software now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.