Chapter 55. Reliable Systems Don’t Happen by Accident

Zach Thomas

While we’re designing intricate systems, beginning with the happy path can be a helpful simplification—but it’s a big mistake to design only for the happy path. While it’s true for any computer program, our problems multiply when we interconnect things in the cloud.

Here’s a partial list of things that go wrong all the time:

  • Something you want to reach over the network is unreachable.

  • Something you want to reach over the network is unusually slow.

  • Demand for your service suddenly overwhelms its capacity.

  • Users create data payloads orders of magnitude larger than you expected.

  • Your API requests are being throttled by your platform.

Among other implications, the cloud era means that operational concerns have become development concerns. Guarding against the unhappy path will make the difference between a reliable system and a smoking wreck.

Any part of your system that is without limits is a part that can bring down your system. This applies to everything from inputs you accept to the amount of time you wait for a response from a downstream system. Enforce cardinalities. Do you expect your customers to create thousands of entries in your content management system? Then don’t make it possible for them to create billions. Another place to enforce limits is at ...

Get 97 Things Every Cloud Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.