Site Reliability Engineering

Book description

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:

  • Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
  • Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
  • Practices—Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systems
  • Management—Explore Google's best practices for training, communication, and meetings that your organization can use

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Safari
    4. How to Contact Us
    5. Acknowledgments
  3. I. Introduction
  4. 1. Introduction
    1. The Sysadmin Approach to Service Management
    2. Google’s Approach to Service Management: Site Reliability Engineering
    3. Tenets of SRE
      1. Ensuring a Durable Focus on Engineering
      2. Pursuing Maximum Change Velocity Without Violating a Service’s SLO
      3. Monitoring
      4. Emergency Response
      5. Change Management
      6. Demand Forecasting and Capacity Planning
      7. Provisioning
      8. Efficiency and Performance
    4. The End of the Beginning
  5. 2. The Production Environment at Google, from the Viewpoint of an SRE
    1. Hardware
    2. System Software That “Organizes” the Hardware
      1. Managing Machines
      2. Storage
      3. Networking
    3. Other System Software
      1. Lock Service
      2. Monitoring and Alerting
    4. Our Software Infrastructure
    5. Our Development Environment
    6. Shakespeare: A Sample Service
      1. Life of a Request
      2. Job and Data Organization
  6. II. Principles
  7. 3. Embracing Risk
    1. Managing Risk
    2. Measuring Service Risk
    3. Risk Tolerance of Services
      1. Identifying the Risk Tolerance of Consumer Services
      2. Identifying the Risk Tolerance of Infrastructure Services
    4. Motivation for Error Budgets
      1. Forming Your Error Budget
      2. Benefits
  8. 4. Service Level Objectives
    1. Service Level Terminology
      1. Indicators
      2. Objectives
      3. Agreements
    2. Indicators in Practice
      1. What Do You and Your Users Care About?
      2. Collecting Indicators
      3. Aggregation
      4. Standardize Indicators
    3. Objectives in Practice
      1. Defining Objectives
      2. Choosing Targets
      3. Control Measures
      4. SLOs Set Expectations
    4. Agreements in Practice
  9. 5. Eliminating Toil
    1. Toil Defined
    2. Why Less Toil Is Better
    3. What Qualifies as Engineering?
    4. Is Toil Always Bad?
    5. Conclusion
  10. 6. Monitoring Distributed Systems
    1. Definitions
    2. Why Monitor?
    3. Setting Reasonable Expectations for Monitoring
    4. Symptoms Versus Causes
    5. Black-Box Versus White-Box
    6. The Four Golden Signals
    7. Worrying About Your Tail (or, Instrumentation and Performance)
    8. Choosing an Appropriate Resolution for Measurements
    9. As Simple as Possible, No Simpler
    10. Tying These Principles Together
    11. Monitoring for the Long Term
      1. Bigtable SRE: A Tale of Over-Alerting
      2. Gmail: Predictable, Scriptable Responses from Humans
      3. The Long Run
    12. Conclusion
  11. 7. The Evolution of Automation at Google
    1. The Value of Automation
      1. Consistency
      2. A Platform
      3. Faster Repairs
      4. Faster Action
      5. Time Saving
    2. The Value for Google SRE
    3. The Use Cases for Automation
      1. Google SRE’s Use Cases for Automation
      2. A Hierarchy of Automation Classes
    4. Automate Yourself Out of a Job: Automate ALL the Things!
    5. Soothing the Pain: Applying Automation to Cluster Turnups
      1. Detecting Inconsistencies with Prodtest
      2. Resolving Inconsistencies Idempotently
      3. The Inclination to Specialize
      4. Service-Oriented Cluster-Turnup
    6. Borg: Birth of the Warehouse-Scale Computer
    7. Reliability Is the Fundamental Feature
    8. Recommendations
  12. 8. Release Engineering
    1. The Role of a Release Engineer
    2. Philosophy
      1. Self-Service Model
      2. High Velocity
      3. Hermetic Builds
      4. Enforcement of Policies and Procedures
    3. Continuous Build and Deployment
      1. Building
      2. Branching
      3. Testing
      4. Packaging
      5. Rapid
      6. Deployment
    4. Configuration Management
    5. Conclusions
      1. It’s Not Just for Googlers
      2. Start Release Engineering at the Beginning
  13. 9. Simplicity
    1. System Stability Versus Agility
    2. The Virtue of Boring
    3. I Won’t Give Up My Code!
    4. The “Negative Lines of Code” Metric
    5. Minimal APIs
    6. Modularity
    7. Release Simplicity
    8. A Simple Conclusion
  14. III. Practices
  15. 10. Practical Alerting from Time-Series Data
    1. The Rise of Borgmon
    2. Instrumentation of Applications
    3. Collection of Exported Data
    4. Storage in the Time-Series Arena
      1. Labels and Vectors
    5. Rule Evaluation
    6. Alerting
    7. Sharding the Monitoring Topology
    8. Black-Box Monitoring
    9. Maintaining the Configuration
    10. Ten Years On…
  16. 11. Being On-Call
    1. Introduction
    2. Life of an On-Call Engineer
    3. Balanced On-Call
      1. Balance in Quantity
      2. Balance in Quality
      3. Compensation
    4. Feeling Safe
    5. Avoiding Inappropriate Operational Load
      1. Operational Overload
      2. A Treacherous Enemy: Operational Underload
    6. Conclusions
  17. 12. Effective Troubleshooting
    1. Theory
    2. In Practice
      1. Problem Report
      2. Triage
      3. Examine
      4. Diagnose
      5. Test and Treat
    3. Negative Results Are Magic
      1. Cure
    4. Case Study
    5. Making Troubleshooting Easier
    6. Conclusion
  18. 13. Emergency Response
    1. What to Do When Systems Break
    2. Test-Induced Emergency
      1. Details
      2. Response
      3. Findings
    3. Change-Induced Emergency
      1. Details
      2. Response
      3. Findings
    4. Process-Induced Emergency
      1. Details
      2. Response
      3. Findings
    5. All Problems Have Solutions
    6. Learn from the Past. Don’t Repeat It.
      1. Keep a History of Outages
      2. Ask the Big, Even Improbable, Questions: What If…?
      3. Encourage Proactive Testing
    7. Conclusion
  19. 14. Managing Incidents
    1. Unmanaged Incidents
    2. The Anatomy of an Unmanaged Incident
      1. Sharp Focus on the Technical Problem
      2. Poor Communication
      3. Freelancing
    3. Elements of Incident Management Process
      1. Recursive Separation of Responsibilities
      2. A Recognized Command Post
      3. Live Incident State Document
      4. Clear, Live Handoff
    4. A Managed Incident
    5. When to Declare an Incident
    6. In Summary
  20. 15. Postmortem Culture: Learning from Failure
    1. Google’s Postmortem Philosophy
    2. Collaborate and Share Knowledge
    3. Introducing a Postmortem Culture
    4. Conclusion and Ongoing Improvements
  21. 16. Tracking Outages
    1. Escalator
    2. Outalator
      1. Aggregation
      2. Tagging
      3. Analysis
      4. Unexpected Benefits
  22. 17. Testing for Reliability
    1. Types of Software Testing
      1. Traditional Tests
      2. Production Tests
    2. Creating a Test and Build Environment
    3. Testing at Scale
      1. Testing Scalable Tools
      2. Testing Disaster
      3. The Need for Speed
      4. Pushing to Production
      5. Expect Testing Fail
      6. Integration
      7. Production Probes
    4. Conclusion
  23. 18. Software Engineering in SRE
    1. Why Is Software Engineering Within SRE Important?
    2. Auxon Case Study: Project Background and Problem Space
      1. Traditional Capacity Planning
      2. Our Solution: Intent-Based Capacity Planning
    3. Intent-Based Capacity Planning
      1. Precursors to Intent
      2. Introduction to Auxon
      3. Requirements and Implementation: Successes and Lessons Learned
      4. Raising Awareness and Driving Adoption
      5. Team Dynamics
    4. Fostering Software Engineering in SRE
      1. Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time
      2. Getting There
    5. Conclusions
  24. 19. Load Balancing at the Frontend
    1. Power Isn’t the Answer
    2. Load Balancing Using DNS
    3. Load Balancing at the Virtual IP Address
  25. 20. Load Balancing in the Datacenter
    1. The Ideal Case
    2. Identifying Bad Tasks: Flow Control and Lame Ducks
      1. A Simple Approach to Unhealthy Tasks: Flow Control
      2. A Robust Approach to Unhealthy Tasks: Lame Duck State
    3. Limiting the Connections Pool with Subsetting
      1. Picking the Right Subset
      2. A Subset Selection Algorithm: Random Subsetting
      3. A Subset Selection Algorithm: Deterministic Subsetting
    4. Load Balancing Policies
      1. Simple Round Robin
      2. Least-Loaded Round Robin
      3. Weighted Round Robin
  26. 21. Handling Overload
    1. The Pitfalls of “Queries per Second”
    2. Per-Customer Limits
    3. Client-Side Throttling
    4. Criticality
    5. Utilization Signals
    6. Handling Overload Errors
      1. Deciding to Retry
    7. Load from Connections
    8. Conclusions
  27. 22. Addressing Cascading Failures
    1. Causes of Cascading Failures and Designing to Avoid Them
      1. Server Overload
      2. Resource Exhaustion
      3. Service Unavailability
    2. Preventing Server Overload
      1. Queue Management
      2. Load Shedding and Graceful Degradation
      3. Retries
      4. Latency and Deadlines
    3. Slow Startup and Cold Caching
      1. Always Go Downward in the Stack
    4. Triggering Conditions for Cascading Failures
      1. Process Death
      2. Process Updates
      3. New Rollouts
      4. Organic Growth
      5. Planned Changes, Drains, or Turndowns
    5. Testing for Cascading Failures
      1. Test Until Failure and Beyond
      2. Test Popular Clients
      3. Test Noncritical Backends
    6. Immediate Steps to Address Cascading Failures
      1. Increase Resources
      2. Stop Health Check Failures/Deaths
      3. Restart Servers
      4. Drop Traffic
      5. Enter Degraded Modes
      6. Eliminate Batch Load
      7. Eliminate Bad Traffic
    7. Closing Remarks
  28. 23. Managing Critical State: Distributed Consensus for Reliability
    1. Motivating the Use of Consensus: Distributed Systems Coordination Failure
      1. Case Study 1: The Split-Brain Problem
      2. Case Study 2: Failover Requires Human Intervention
      3. Case Study 3: Faulty Group-Membership Algorithms
    2. How Distributed Consensus Works
      1. Paxos Overview: An Example Protocol
    3. System Architecture Patterns for Distributed Consensus
      1. Reliable Replicated State Machines
      2. Reliable Replicated Datastores and Configuration Stores
      3. Highly Available Processing Using Leader Election
      4. Distributed Coordination and Locking Services
      5. Reliable Distributed Queuing and Messaging
    4. Distributed Consensus Performance
      1. Multi-Paxos: Detailed Message Flow
      2. Scaling Read-Heavy Workloads
      3. Quorum Leases
      4. Distributed Consensus Performance and Network Latency
      5. Reasoning About Performance: Fast Paxos
      6. Stable Leaders
      7. Batching
      8. Disk Access
    5. Deploying Distributed Consensus-Based Systems
      1. Number of Replicas
      2. Location of Replicas
      3. Capacity and Load Balancing
    6. Monitoring Distributed Consensus Systems
    7. Conclusion
  29. 24. Distributed Periodic Scheduling with Cron
    1. Cron
      1. Introduction
      2. Reliability Perspective
    2. Cron Jobs and Idempotency
    3. Cron at Large Scale
      1. Extended Infrastructure
      2. Extended Requirements
    4. Building Cron at Google
      1. Tracking the State of Cron Jobs
      2. The Use of Paxos
      3. The Roles of the Leader and the Follower
      4. Storing the State
      5. Running Large Cron
    5. Summary
  30. 25. Data Processing Pipelines
    1. Origin of the Pipeline Design Pattern
    2. Initial Effect of Big Data on the Simple Pipeline Pattern
    3. Challenges with the Periodic Pipeline Pattern
    4. Trouble Caused By Uneven Work Distribution
    5. Drawbacks of Periodic Pipelines in Distributed Environments
      1. Monitoring Problems in Periodic Pipelines
      2. “Thundering Herd” Problems
      3. Moiré Load Pattern
    6. Introduction to Google Workflow
      1. Workflow as Model-View-Controller Pattern
    7. Stages of Execution in Workflow
      1. Workflow Correctness Guarantees
    8. Ensuring Business Continuity
    9. Summary and Concluding Remarks
  31. 26. Data Integrity: What You Read Is What You Wrote
    1. Data Integrity’s Strict Requirements
      1. Choosing a Strategy for Superior Data Integrity
      2. Backups Versus Archives
      3. Requirements of the Cloud Environment in Perspective
    2. Google SRE Objectives in Maintaining Data Integrity and Availability
      1. Data Integrity Is the Means; Data Availability Is the Goal
      2. Delivering a Recovery System, Rather Than a Backup System
      3. Types of Failures That Lead to Data Loss
      4. Challenges of Maintaining Data Integrity Deep and Wide
    3. How Google SRE Faces the Challenges of Data Integrity
      1. The 24 Combinations of Data Integrity Failure Modes
      2. First Layer: Soft Deletion
      3. Second Layer: Backups and Their Related Recovery Methods
      4. Overarching Layer: Replication
      5. 1T Versus 1E: Not “Just” a Bigger Backup
      6. Third Layer: Early Detection
      7. Knowing That Data Recovery Will Work
    4. Case Studies
      1. Gmail—February, 2011: Restore from GTape
      2. Google Music—March 2012: Runaway Deletion Detection
    5. General Principles of SRE as Applied to Data Integrity
      1. Beginner’s Mind
      2. Trust but Verify
      3. Hope Is Not a Strategy
      4. Defense in Depth
    6. Conclusion
  32. 27. Reliable Product Launches at Scale
    1. Launch Coordination Engineering
      1. The Role of the Launch Coordination Engineer
    2. Setting Up a Launch Process
      1. The Launch Checklist
      2. Driving Convergence and Simplification
      3. Launching the Unexpected
    3. Developing a Launch Checklist
      1. Architecture and Dependencies
      2. Integration
      3. Capacity Planning
      4. Failure Modes
      5. Client Behavior
      6. Processes and Automation
      7. Development Process
      8. External Dependencies
      9. Rollout Planning
    4. Selected Techniques for Reliable Launches
      1. Gradual and Staged Rollouts
      2. Feature Flag Frameworks
      3. Dealing with Abusive Client Behavior
      4. Overload Behavior and Load Tests
    5. Development of LCE
      1. Evolution of the LCE Checklist
      2. Problems LCE Didn’t Solve
    6. Conclusion
  33. IV. Management
  34. 28. Accelerating SREs to On-Call and Beyond
    1. You’ve Hired Your Next SRE(s), Now What?
    2. Initial Learning Experiences: The Case for Structure Over Chaos
      1. Learning Paths That Are Cumulative and Orderly
      2. Targeted Project Work, Not Menial Work
    3. Creating Stellar Reverse Engineers and Improvisational Thinkers
      1. Reverse Engineers: Figuring Out How Things Work
      2. Statistical and Comparative Thinkers: Stewards of the Scientific Method Under Pressure
      3. Improv Artists: When the Unexpected Happens
      4. Tying This Together: Reverse Engineering a Production Service
    4. Five Practices for Aspiring On-Callers
      1. A Hunger for Failure: Reading and Sharing Postmortems
      2. Disaster Role Playing
      3. Break Real Things, Fix Real Things
      4. Documentation as Apprenticeship
      5. Shadow On-Call Early and Often
    5. On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
    6. Closing Thoughts
  35. 29. Dealing with Interrupts
    1. Managing Operational Load
    2. Factors in Determining How Interrupts Are Handled
    3. Imperfect Machines
      1. Cognitive Flow State
      2. Do One Thing Well
      3. Seriously, Tell Me What to Do
      4. Reducing Interrupts
  36. 30. Embedding an SRE to Recover from Operational Overload
    1. Phase 1: Learn the Service and Get Context
      1. Identify the Largest Sources of Stress
      2. Identify Kindling
    2. Phase 2: Sharing Context
      1. Write a Good Postmortem for the Team
      2. Sort Fires According to Type
    3. Phase 3: Driving Change
      1. Start with the Basics
      2. Get Help Clearing Kindling
      3. Explain Your Reasoning
      4. Ask Leading Questions
    4. Conclusion
  37. 31. Communication and Collaboration in SRE
    1. Communications: Production Meetings
      1. Agenda
      2. Attendance
    2. Collaboration within SRE
      1. Team Composition
      2. Techniques for Working Effectively
    3. Case Study of Collaboration in SRE: Viceroy
      1. The Coming of the Viceroy
      2. Challenges
      3. Recommendations
    4. Collaboration Outside SRE
    5. Case Study: Migrating DFP to F1
    6. Conclusion
  38. 32. The Evolving SRE Engagement Model
    1. SRE Engagement: What, How, and Why
    2. The PRR Model
    3. The SRE Engagement Model
      1. Alternative Support
    4. Production Readiness Reviews: Simple PRR Model
      1. Engagement
      2. Analysis
      3. Improvements and Refactoring
      4. Training
      5. Onboarding
      6. Continuous Improvement
    5. Evolving the Simple PRR Model: Early Engagement
      1. Candidates for Early Engagement
      2. Benefits of the Early Engagement Model
    6. Evolving Services Development: Frameworks and SRE Platform
      1. Lessons Learned
      2. External Factors Affecting SRE
      3. Toward a Structural Solution: Frameworks
      4. New Service and Management Benefits
    7. Conclusion
  39. V. Conclusions
  40. 33. Lessons Learned from Other Industries
    1. Meet Our Industry Veterans
    2. Preparedness and Disaster Testing
      1. Relentless Organizational Focus on Safety
      2. Attention to Detail
      3. Swing Capacity
      4. Simulations and Live Drills
      5. Training and Certification
      6. Focus on Detailed Requirements Gathering and Design
      7. Defense in Depth and Breadth
    3. Postmortem Culture
    4. Automating Away Repetitive Work and Operational Overhead
    5. Structured and Rational Decision Making
    6. Conclusions
  41. 34. Conclusion
  42. A. Availability Table
  43. B. A Collection of Best Practices for Production Services
    1. Fail Sanely
    2. Progressive Rollouts
    3. Define SLOs Like a User
    4. Error Budgets
    5. Monitoring
    6. Postmortems
    7. Capacity Planning
    8. Overloads and Failure
    9. SRE Teams
  44. C. Example Incident State Document
  45. D. Example Postmortem
    1. Lessons Learned
    2. Timeline
    3. Supporting information:
  46. E. Launch Coordination Checklist
  47. F. Example Production Meeting Minutes
  48. Bibliography
  49. Index

Product information

  • Title: Site Reliability Engineering
  • Author(s): Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff
  • Release date: April 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491929124