Building Resilient Distributed Systems

Book description

Struggling with system failures despite multiple safeguards? In the world of software development, distributed systems have proven essential yet remain complex. If you're tired of facing unexpected downtimes and persistent system vulnerabilities, you're not alone.

In this insightful guide, author Sam Newman offers more than just technical solutions. By combining wisdom from human factors and system safety with proven stability patterns, you'll uncover a clear pathway to truly enhance resilience. Whether you're refining an existing system, embarking on creating new microservices, or simply striving to grasp the essence of resilience, this book is your ally. By the end, you'll be able to:

  • Grasp the true essence of resilience by integrating safety and human factors
  • Discover architectural patterns that ensure system stability
  • Translate resilience theories into practical, actionable strategies
  • Embrace the necessary cultural and behavioral shifts to support a resilient system

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. Preface
    1. Who This Book Is For
    2. What You Will Learn
    3. Navigating The Book
      1. Part 1: Technical
      2. Part 2: People, Process, and Culture
  3. 1. What Is Resiliency?
    1. What Is Resilience?
      1. Technology
      2. Social
    2. Why Resilience Matters
    3. What Is A Distributed System?
      1. How Distributed Systems Can Fail (Us)
      2. Two Golden Rules Of Distributed Systems
    4. The Human Factor
    5. The Sociotechnical System
      1. The Hexagonal Model For Sociotechnical systems
    6. The Four Concepts Of Resilience
      1. Robustness
      2. Rebound
      3. Graceful Extensibility
      4. Sustained Adaptability
    7. How Resilient Do You Need To Be?
    8. Summary
  4. 2. Timeouts
    1. The Problem With Time
    2. Why Timeout at all?
    3. Finding The Sweet Spot
    4. Analyzing Existing Response Times
      1. Tail Latencies
      2. Tight Latency Bounds
    5. How Many Requests Can You Handle?
    6. Using User Expectations
    7. Fault Injection and Testing Timeouts
    8. Timeouts and Call Chains
    9. Timeout Propagation
      1. Clock Skew And Timeouts
    10. Other Considerations
    11. Case Study: AdvertCorp
      1. Overload
      2. Customer-driven Denial Of Service
      3. Page Load Times
      4. Missing Timeouts
      5. A Confluence Of Events
    12. Conclusion
  5. 3. Retries and Idempotency
    1. Should You Always Retry?
      1. A Better Type Of Error
    2. How Many Retries Are Appropriate?
      1. Fixed Retry Count
      2. Dynamic Retry Limit
    3. Delays Between Retries
      1. Exponential Back-off
      2. Jitter
    4. Is Retrying Safe?
    5. Idempotency
      1. Making Operations Idempotent
      2. Real World Examples
    6. Conclusion
    7. Further Reading
  6. 4. Rate Limiting
    1. Ways To Handle Having Too Much Work
      1. Just Fall Over
      2. Throw Away Some Of The Work (Load Shedding)
      3. Reduce The Work Being Sent (Back Pressure)
      4. Queue Up The Work
      5. Dynamically Provision More Resources
    2. Load Shedding
      1. Triggering Load Shedding
      2. Communicating Load Shedding To Clients
      3. Is All Work Equal?
    3. Back Pressure
      1. Client-Only Back Pressure
      2. Accord-based Back Pressure
    4. Circuit Breakers
      1. Implementation Overview
      2. Case Study: AdvertCorp
      3. For Client-only or Accord-based Back Pressure
      4. Issues With Circuit Breakers
    5. Reducing vs Stopping Traffic
      1. Boom & Bust Cycle
      2. A Delicate Balance
      3. Leaky And Token Bucket Rate Limiting
      4. Conclusion

Product information

  • Title: Building Resilient Distributed Systems
  • Author(s): Sam Newman
  • Release date: November 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098163549