Chapter 8. Reliability

Everything fails, all the time.

Werner Vogels, CTO of Amazon Web Services

In software systems, there’s always something that can go wrong. Perhaps a database connection pool fills up, perhaps there’s unexpected latency due to network issues, or maybe there’s a bad deployment and a service starts responding with 500s. The likelihood of encountering failure is even higher in microservices infrastructures because a single request can involve multiple services. For example, every service may have 99.9% uptime, but if there are five services involved in each request, your uptime as a whole will only be 99.5%.1 That might not seem like a huge difference, but it’s actually an increase from 8 hours of downtime per year to 43!

Outright preventing failure across all your systems is simply not possible. There are too many components involved, the complexity is too high, and there is only so much you can invest into reliability without taking away time from user-facing features and other business needs. Since failure is inevitable, the best you can do is engineer your systems to handle failure gracefully. Handling failure gracefully means reducing the impact of failure as much as possible.

Consul can help reduce the impact of failure via its sidecar proxies. The three techniques looked at in this chapter are health checking, retries, and timeouts.

Health checking detects service failure and bypasses those services by routing to other healthy instances. There are two ...

Get Consul: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.