Book description
Site Reliability Engineering (SRE)—a framework for managing enterprise software systems, first developed at Google—helps lower operational costs, enhance development productivity, and increase feature release. But if service-level objectives (SLOs) aren’t part of your SRE strategy, you’re leaving value on the table. This practical report details why and how to make SLOs, service-level indicators (SLIs), and error budgets critical components of your SRE practice.
Drawing on results from Google’s recent SLO Adoption and Usage Survey, along with real-world case studies, this guide walks you through defining and determining an acceptable level of reliability and using it to set expectations for stability and better manage system changes. Whether you’re an SRE, executive, developer, or architect, you’ll learn how to improve your SRE practices by taking an SLO and error-based approach to measuring and managing your service.
- Understand common service-level terminology, including objectives, indicators, agreements, and error budgets
- Build SLOs and SLIs step by step
- Use error budgets to align and jointly make decisions about reliability and development velocity
- See how Schlumberger and Evernote implemented SLOs and used the insights gained to manage their businesses
Table of contents
- Executive Summary
- Preface
- 1. SLOs: The Magic Behind SRE
-
2. Summary of the Data
- Who Took Our Survey
- Most Firms Have Had SRE Teams for Fewer Than Three Years
- Who Uses SLOs
-
How Organizations Use SLOs
- Most Firms Embrace SRE Practices but Fail to Engage in SLOs
- Critical Infrastructure Is the Most Common Service Measured by SLOs
- Majority of Respondents Measure “Some” of Their Services with SLOs
- SLOs Above 99% Are Most Common Among Respondents
- Internal Action Is the Most Common Response to Missing SLOs
- SLO Reviews Are Underutilized by the Majority of Respondents
- Availability Is the Top SLI Measurement
- Summary
- 3. Selecting SLOs
-
4. Constructing SLIs to Inform SLOs
- Defining SLIs
- SLIs Are Metrics to Deliver User Happiness
- Common SLI Types
- SLI Structure
- Developing SLIs
- Tracking Reliability with SLIs
- Ways to Measure SLIs
- Use SLIs to Define SLOs
- Determine a Time Window for Measuring SLOs
- SLO Examples for Availability and Latency
- Iterating and Improving SLOs
- Summary
- 5. Using Error Budgets to Manage a Service
- 6. SLO Implementation Case Studies
- 7. Conclusion
Product information
- Title: SLO Adoption and Usage in Site Reliability Engineering
- Author(s):
- Release date: April 2020
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492075363
You might also like
book
Site Reliability Engineering
The overwhelming majority of a software system's lifespan is spent in use, not in design or …
book
Practical Site Reliability Engineering
Create, deploy, and manage applications at scale using SRE principles Key Features Build and run highly …
video
Site Reliability Engineering on AWS
Reliability in AWS includes the ability of a system to recover from infrastructure or service disruptions. …
video
Site Reliability Engineering Fundamentals
Over the past five years, the ideas behind site reliability engineering (SRE) have caught fire because …