Chapter 4. Choosing Good Service Level Objectives
Every system will fail at some point. Sometimes systems will fail in catastrophic ways, and other times they can fail in ways that are barely noticeable. System failures can be lengthy or last just fractions of a second. Some system failures require human intervention in order to get things back into a good state, while in other cases the system may start operating correctly again all by itself.
Chapter 3 discussed how to think about measuring if a service is doing what it is supposed to be doing, and from that framing we can define failure as when a service is not doing what it is supposed to be doing. A service failure does not have to mean there is an emergency. Things as simple as an API not sending a response quickly enough or the click of a button on a web form not registering are both failures. Failure happens all the time, because complex systems are fallible. This is all totally fine and expected and shouldnât stress you out.
Problems only arise when failures occur too often or last too long, and thatâs what service level objectives are all about. If a service level indicator gives you a good way to think about whether your service is performing in the manner it should be, a service level objective gives you a good way to think about whether your service is doing so often enough.
Weâve established that you canât be perfect, but how good should you try to be instead? This chapter looks to help you figure that out. First, weâll talk about what SLO targets really are and why itâs important to choose them to the best of your ability. Second, weâll spend a little bit of time talking about service components and dependencies and how to take these into consideration when setting SLOs. After that, weâll get into some of the ways you can use data to help you pick these targets, including an introduction to the basic statistics youâll need in order to do so.
Reliability Targets
Fundamentally, SLOs are targets: theyâre a defined set of criteria that represent an objective that youâre trying to reach. Good SLOs generally have two traits in common:
If you are exceeding your SLO target, your users are happy with the state of your service.
If you are missing your SLO target, your users are unhappy with the state of your service.
But what exactly do we mean by user happiness?
User Happiness
When we talk about user happiness in terms of service reliability, weâre mostly appealing to the contentment aspect of happiness. Itâs not necessarily the case that the users of your service have to be actively and consciously overjoyed with their experience in order for them to be happy. For some people, it might be easier to think about it in terms of your users not being unhappy.
At some level, these ideas come from the concept that you need satisfied users in order to have a growing business. Reliability is a service feature that will often determine if people will choose to use yours as opposed to another one. Chances are that one of the goals of your service is to attract more users, even if you arenât strictly a business. Being reliable, and thinking about the happiness of your users, is a major component of this.
This is also applicable to services that do not strictly serve customers. For example, if youâre in charge of the database offering for your organization, and your offering is seen as too unreliable by other engineers, theyâre going to find ways to work around this. They might spin up their own database instances when they really shouldnât, or they might try to solve data storage problems in a suboptimal manner that doesnât involve a database at all.
We could also imagine an internal service that users canât find a workaround for. Perhaps you maintain the Kubernetes layer at your organization. If users of this service (your fellow engineers) are too unhappy about its reliability, theyâll eventually get fed up and find some way to move to a different serviceâeven if that means actually leaving the company.
You want to make sure that youâre reliable, and you want to make sure that your users are happy. Whatever targets you choose, they have to be ones that keep this in mind.
The Problem of Being Too Reliable
That all being said, you also donât want to be too reliable. There are a few reasons for this.
Imagine, for example, that youâve chosen an SLO target percentage of 99.9%. Youâve done a lot of due diligence and followed the advice in this book in order to determine that this is the right objective for you. As long as youâre exceeding this 99.9%, users arenât complaining, they arenât moving elsewhere, and your business is growing and doing well.
Additionally, if you miss this target by just a little bit, you likely wonât immediately hemorrhage users. This is ideal, since it gives you time to say, âWeâve missed our target, so now we need to refocus our efforts to ensure we stay above it more often.â You can use the data that your SLO provides you in order to make decisions about the service and the work youâre performing.
However, letâs now imagine that youâre routinely being 99.99% reliable instead of just hitting your 99.9% target. Even if your SLO is published and discoverable, people are going to end up expecting that things will continue to be 99.99% reliable, because humans generally expect the future to look like the past.1 Even if it was true that in the past everyone was actually happy with 99.9%, their expectations have now grown. Sometimes this is absolutely fine. Services and products can mature over time, and providing your users with a good experience is never a bad idea.
So maybe you make your official target more stringent, and now you aim for 99.99%. By doing so youâre giving yourself fewer opportunities to fail but also fewer opportunities to learn. If youâre being too reliable all the time, youâre also missing out on some of the fundamental features that SLO-based approaches give you: the freedom to do what you want. If youâre being too reliable, youâre missing out on opportunities to experiment, perform chaos engineering, ship features quicker than you have before, or even just induce structured downtime to see how your dependencies reactâin other words, a lot of ways to learn about your systems.
Additionally, you need to think about the concept of operational underload. People learn how to fix things by doing so. Especially in complex systems, you can learn so much from failures. There is almost no better way to learn about how systems work than to respond to them when they arenât performing how theyâre supposed to. If things never fail, youâll be missing out on all of that.
Tip
Chapter 5 goes into much more detail about how to use error budgets, but ensuring you donât lose insight into how your services work by inducing failure or allowing it to occur is one of the main components at play. If your users and your business only need you to be 99.9% reliable, it is often a good idea to make sure youâre not far beyond that. Youâll still want to make sure that youâre able to handle unforeseen issues, but you can set appropriate expectations as well as provide useful learning opportunities if you make sure youâre not too reliable all the time. Pick SLO target percentages that allow for all of this to be true when you can.
The Problem with the Number Nine
In addition to the desire to be too reliable, there is another problem you can run into when picking the correct SLO for your service. When people talk about SLOs and SLAs, they most often think about things in terms of ânines.â
Even if you donât want to aim for 100% reliability, you do almost always want to be fairly reliable, so itâs not surprising that many common reliability targets are very close to 100%. The most common numbers you might run into are things like 99%, 99.9%, 99.99%, or even the generally unattainable 99.999%.2 These targets are so common, people often even refer to them as just âtwo nines,â âthree nines,â âfour nines,â and âfive nines.â
Table 4-1 shows what these targets actually look like in terms of acceptable bad time.3
Target | Per day | Per month | Per year |
---|---|---|---|
99.999% | 0.9 s | 26.3 s | 5 m 15.6 s |
99.99% | 8.6 s | 4 m 23 s | 52 m 35.7 s |
99.9% | 1 m 26.4 s | 43 m 49.7 s | 8 h 45 m 57 s |
99% | 14 m 24 s | 7 h 18 m 17.5 s | 3 d 15 h 39 m |
Not only can hitting these long strings of nines be much more difficult and expensive than people realize, but there is also a general problem where people only think about SLO targets as comprising series of the number nine, when in reality this doesnât make any sense at all. Picking the right target for your service involves thinking about your users, your engineers, and your resourcesâit shouldnât be arbitrarily constrained in this way.
You might also see targets such as 99.95% or 99.98%, and including these is certainly an improvement over using only the number nine, but even here youâre not always allowing yourself enough nuance to describe the requirements of your exact service.
There is absolutely nothing wrong with having an SLO defined as having a target of something like 99.97%, 98.62%, or even 87%. You can address having low target percentages by using percentiles in your definitionâweâll talk more about that later in this chapterâbut you should also make sure you arenât tied to thinking about these targets just in terms of the number nine.
Table 4-2 shows some other options and what amounts of bad time those translate into.
Target | Per day | Per month | Per year |
---|---|---|---|
99.95% | 42.2s | 5m2.4s | 4h22m58.5s |
99.7% | 4m19.2s | 30m14.4s | 1d2h17m50.9ss |
99.3% | 10m4.8ss | 5h6m48.2s | 2d13h21m38.7s |
98% | 28m48s | 14h36m34.9s | 7d7h18m59s |
Thatâs not to say you should be aiming at a lower target if you donât have a reason to do so, but the difference between 99.9% and 99.99% (or something similar) is often much greater than people realize at first. You should be looking at the numbers in between as well.
Sometimes itâs helpful to start with a time rather than a percentage. For example, it might be reasonable (or even required due to the daily downtime of your dependencies, locking backups taking place, and so on) to want to account for about two hours of unreliability per month. In that case 99.7% would be the correct starting point, and you could move on from there after seeing how you perform at that target for some time. Some of the most useful SLOs I have personally worked with have been set at carefully measured numbers like 97.2%, and there is nothing wrong with that. Later in this chapter weâll discuss in more depth how to do this math and make these measurements.
The Problem with Too Many SLOs
As you start on your journey toward an SLO-based approach to reliability, it might be tempting to set a lot of SLOs for your services. There is no correct number of SLOs to establish, and the number that will be correct for you will heavily depend on both how complex your services are and how mature the SLO culture in your organization is.
While you do want to capture the most important features of your system, you can often accomplish this by measuring only a subset of these features. Always ask yourself what your users need, and start by observing the most important and common of these needs. SLOs are a process, and you can always add (or remove!) them at any point that makes sense.
When the number of SLOs you have grows to be too large, youâll run into a few particular problems. First, it will be more difficult to make decisions using your data. If you view SLOs as providing you with data you can use to make meaningful choices about how to improve your service, having too many divergent data points can result in these decisions being harder to make. Imagine for example a storage service with a simple caching layer. It might not be necessary to have separate SLOs for both cache miss latency and cache hit latency for reads. Youâll certainly still want to be collecting metrics on both, but you might just be adding noise if you have entirely independent SLOs for each. In this situation you could just have an SLO for general read latency, and if you start performing badly against your target, you can use your separate metrics to determine where the problem liesâhits, misses, or bothâand what you need to address to make things better.
The second problem you can run into is that it becomes more complicated to report to others what the reliability status of your service has been. If you can provide someone outside of your team with the status of three to five SLOs over time, they can probably infer from that data both how your service has been running and how they could set their own targets if they depend on it. If they have to sort through dozens of SLOs, perhaps all with different target percentages and histories, youâre not benefiting from one of the elements that this whole process is about: communicating your reliability to others in an easy-to-understand way.
More generally, there are statistical issues that arise with too many measurements. The multiple comparison problem, at its most basic, is one that arises due to the fact that if you have many different measurements of the same system, there are greater chances of incorrect measurements taking place. And even if the measurements are actually correct, if youâre looking at too many things youâll always find something that looks just slightly off, which can just waste your time by sending you down endless rabbit holes.
Note
Every system is unique, and there is no perfect answer to the question of how many SLOs you should define for any particular service. As with everything, try to be reasonable. SLOs are about providing you data to have discussions about, and you canât do that if you have too many data points to discuss.
Service Dependencies and Components
No service stands alone; everything depends on something else. Downstream, microservices often have dependencies that look like other microservices and databases. Services that appear to be mostly standalone will always have upstream dependencies, such as load balancers, routers, and the network in general. In both of these situations, these services will be dependent upon a compute layer of some sort, be that a container orchestration layer, a virtual machine infrastructure, or an operating system running on a bare-metal physical machine.
And we can go much deeper than that. An operating system running on a physical machine is dependent on that physical machine, which is dependent on things like power circuits, which are dependent on delivery from an electrical substation, and so forth. We could continue down this path virtually infinitely.
Because everything has many dependencies, it also turns out that services often have many components. Complex computer systems are made up of deep interwoven layers of service dependencies, and before you can set appropriate SLO targets, you need to understand how the various components of your services interact with each other.
Service Dependencies
When thinking about what kind of objective you can set for your service, you have to think about the dependencies your service has. There are two primary types of service dependencies. First are the hard dependencies. A hard dependency is one that has to be reliable for your service to be reliable. For example, if your service needs to read from a database in order to do what it is supposed to do, it cannot be reliable if that database isnât. Second are soft dependencies. A soft dependency is something that your service needs in order to operate optimally but that it can still be reliable without. Converting your hard dependencies into soft ones is one of the best steps you can take to make your service more reliable.
To choose a good service level objective, you have to start by examining how reliable your dependencies are. Thereâs some simple math you can do to calculate the effect they have on the reliability your service can offer; Iâll show you that after we dig a little more deeply into the issues of hard and soft dependencies.
Hard dependencies
Understanding the effect your known hard dependencies have on your service is not an overly complicated ordeal.4 If the reliability of your service directly depends on the reliability of another service, your service cannot be any more reliable than that one is. There are two primary ways you can determine the reliability of a hard dependency.
The first is just to measure it. To continue with our database example, you can measure how many requests to this database complete without a failureâwhether that be without an error or timeout, or quickly enoughâdirectly from your own service. You donât have to have any kind of administrative access to the database to understand how it works from your perspective. In this situation, you are the user, and you get to determine what reliable means. Measure things for a while, and use the result to determine what kind of reliability you might be able to expect moving into the future.
The second, and more meaningful, way is to look at the published SLOs and reliability history of your dependencies, if they have them and theyâre shared with users. If the team responsible for the database you depend upon has internalized the lessons of an SLO-based approach, you can trust them to publish reasonable SLO targets. You can trust that team to take action if their service starts to exceed its error budget, so you can safely set your target a little lower than theirs.
Soft dependencies
Soft dependencies are a little more difficult to define than hard dependencies, and they also vary much more wildly in how they impact the reliability of your service. Hard dependencies are pretty simple to define and locate, and if a hard dependency isnât being reliableâwhether itâs entirely unavailable or just responding slowlyâyour service isnât being reliable during that time frame, either.
Soft dependencies, however, donât have this same one-to-one mapping. When theyâre unreliable the reliability of your service may be merely impacted, not nullified. A good example is services that provide additional data to make the user experience most robust, but arenât strictly required for it to function.
For example, imagine a maps application on your phone. The primary purpose of such an application could be to display maps of your immediate surroundings, show what businesses or addresses are located where, and help you orient yourself. The application might also allow you to overlay additional data such as traffic congestion, user reviews of restaurants, or a satellite view. If the services that provide this traffic, user review, or satellite maps data arenât operating reliably, it certainly impacts the reliability of the maps application, but it doesnât make it wholly unreliable since the application can still perform its primary functions.
Turning hard dependencies into soft dependencies
One of the best things you can do in terms of making your service more reliable is to remove the hard dependencies it might have. Removing hard dependencies is not often a viable option, however, so in those situations you should think about how you might at least be able to turn them into soft dependencies instead.
For instance, going back to our database example, you might be able to introduce a caching layer. If much of the data is similarâor it doesnât necessarily have to be up-to-date to the secondâusing a cache could allow you to continue to operate reliably from the perspective of your users even if there are failures happening on the backend.
The topic of turning hard dependencies into soft ones is way too large for this book, but remember to think about this as you determine your SLO targets and use this as an example of the kind of work you could perform to increase the reliability of your service.
Dependency math
Perhaps the most important part of thinking about your service dependencies is understanding how to perform the math you need in order to take their reliability into account. You cannot promise a better reliability target than the things you are dependent on.
Most services arenât just individual pieces that float around in an empty sea. In a world of microservices where each might have a single team assigned to it, these services work together as a collective to comprise an entirely different service, which may not have a dedicated team assigned to it. Services are generally made up of many components, and when each of those components has its reliability measuredâor its own SLO definedâyou can use that data to figure out mathematically what the reliability of a multicomponent service might look like.
An important takeaway for now is how quickly a reasonable reliability target can erode in situations such as this. For example, letâs say your service is a customer-facing API or website of some sort. A reasonably modern version of a service such as this could have dozens and dozens of internal components, from container-based microservices and larger monoliths running on virtual machines, to databases and caching layers.
Imagine you have 40 total components, each of which promises a 99.9% reliability target and has equal weight in terms of how it can impact the reliability of the collective service. In such situations, the service as a whole can only promise much less than 99.9% reliability. Performing this math is pretty simpleâyou just multiply 99.9% by itself 40 times:
So, 40 service components running at 99.9% reliability can only ensure that the service made up of these components can ever be 96% reliable. This math is, of course, overly simplistic compared to what you might actually see in terms of service composition in the real world, and Chapter 9 covers more complicated and practical ways to perform these kinds of calculations. The point for now is to remember that you need to be reasonable when deciding how stringent you are with your SLOsâyou often cannot actually promise the reliability that you think or wish you could. Remember to stay realistic.
Service Components
As weâve seen (for example, in the case of the retail website discussed in the previous chapter), a service can be composed of multiple components, some of which may themselves be services of different types. Such services generally fall into two categories: those whose components are owned by multiple teams, and those whose components are all owned by the same team. Naturally, this has implications when it comes to establishing SLIs and SLOs.
Multiple-team component services
When a service consists of many components that are owned by multiple teams, there are two primary things to keep in mind when choosing SLIs and SLOs.
The first is that even if SLOs are set for the entire service, or a subset of multiple components, each team should probably have SLIs and SLOs for its own components as well. The primary goal of SLO-based approaches to reliability is to provide you with the data you need to make decisions about your service. You can use this data to ask questions like: Is the service reliable enough? Should we be spending more time on reliability work as opposed to shipping new features? Are our users happy with the current state of the world? Each team responsible for a service needs to have the data to consider these questions and make the right decisionsâtherefore, all the components of a service owned by multiple teams should have SLOs defined.
The second consideration about services owned by many teams is determining who owns the SLOs that are set for the overarching service, or even just subsets of that service. Chapter 15 addresses the issue of ownership in detail.
Single-team component services
For services that consist of multiple components that are all owned by a single team, things can get a bit more variable. On the one hand, you could just apply the lessons of the multiple-team component services and set SLOs for every part of your stack. This is not necessarily a bad idea, but depending on the size or complexity of the service, you could also end up in a situation where a single team is responsible for and inundated by SLO definitions and statuses for what realistically is just a single service to the outside world.
When a single team owns all the components of what users think of as a service, it can often be sufficient to just define meaningful SLIs that describe enough of the user journeys and set SLOs for those measurements. For example, if youâre responsible for a logging pipeline that includes a message queue, an indexing system, and data storage nodes, you probably donât need SLOs for each of those components. An SLI that measures the latency between when a message is inserted and when it is indexed and available for querying is likely enough to capture most of what your users need from you. Add in another SLI that ensures data integrity, and youâve probably got most of your usersâ desires covered. Use those kinds of SLIs to set the minimum number of SLO targets you actually need, but remember to also use telemetry that tells you how each component is operating to help you figure out where to apply your reliability efforts when your data tells you to do so.
Reliability for Things You Donât Own
The classic example of how SLOs work involves a dichotomy between a development team and an operational team, both responsible for the same service in different ways. In this prototypical example, the development team wants to move fast and ship features, and the operations team wants to move slowly to ensure stability and reliability. SLOs are a way to ease this tension, as they give you data specifically aimed at determining when to move fast and when to move slow. This is the basic foundation of Site Reliability Engineering.
If your team or service doesnât fit into this model, however, that doesnât mean you canât adopt an SLO-based approach. If the service your team supports is open source, proprietary from a vendor, or hardware, you canât really use the primary example of âstop shipping features and focus on reliability code improvements insteadââbut that doesnât mean you canât shift your focus to reliability. You just have to do it in a slightly different manner.
Open Source or Hosted Services
If youâre relying on open source software for your infrastructure, as many companies do, you can still make changes to improve reliabilityâitâs just that the changes you make are not always directly applicable to the code at the base of things. Instead, theyâre likely things like configuration changes, architecture changes, or changes to in-house code that complements the service in some way. This isnât to say that these sorts of changes donât also apply to services for which you own the entire codebaseâjust that the classic examples of how SLO-based approaches work often overlook them.
Additionally, you might be reliant on software that is entirely hosted and managed. This can make reliability approaches even more difficult, because in these situations there may not be many configuration or architecture changes you can make. Instead, when thinking about SLOs for these sorts of services, you might start with a baseline that represents the amount of failure a user can tolerate and use this data to justify either renewing a contract or finding a new vendor that can meet your needs.
Measuring Hardware
Chances are there are many different hardware components you might like to observe and measure, but itâs not often worth your time unless youâre operating at a certain scale. Commercial and enterprise-grade computer hardware is generally already heavily vetted and measured by the manufacturers, and you often cannot develop a system of measurement with enough meaningful data points unless you are either a hardware development company, a telco/internet service provider, or one of the largest web service providers. Remember that unless you introduce complicated math to normalize your data, you generally need quite a few data points in order to ensure that your SLIs arenât triggered only by outliers.
That all being said, you donât have to operate at the scale of a telco or one of the largest tech companies to meaningfully measure the failure rates or performance of your hardware. For example, imagine youâre responsible for 2,000 servers in various data centers across the planet. Though the number 2,000 isnât necessarily very large when it comes to statistical analysis, the numbers derived from it could be. You might have 8 hard drives or DIMMs per server, which gives you 16,000 data points to work with. That might be enough for you to develop meaningful metrics about how long your hardware operates without a fault.
Another option is to get aggregated data from other sources, and then apply those same metrics to your own hardware. It can be difficult to get failure rate data from vendors directly, but many resellers collect this data and make it available to their paying customers. You can use this sort of information to help you anticipate the potential failure rates of your own hardware, allowing you to set SLOs that can inform you when you should be ordering replacements or when you should be retiring old systems.
In addition to reseller vendors, there are other aggregated sources of data about hardware failure. For example, Backblaze, a major player in the cloud storage space, releases reports every year about the failure rates of the various hard drive makes and models it employs.
The point is that if you donât have a large enough footprint to use your own measurements to develop statistically sound numbers, you can rely on those who have done this aggregation for you. Weâll also be discussing statistical models you can use to meaningfully predict things you only have sporadic data for in Chapter 9.
But I am big enough!
Of course, you might work for a company that operates at such a scale that you can measure your own hardware failure rates easily. Perhaps youâre even a hardware manufacturer looking to learn about how you can translate failure data into more meaningful data for your customers!
If you have a lot of data, developing SLO targets for your hardware performance doesnât really deviate from anything discussed elsewhere in this book. You need to figure out how often you fail, determine if that level is okay with the users of your hardware, and use that data to set an SLO target that allows you to figure out whether youâre going to lose users/customers due to your unreliability or not.
If your SLO targets tell you that youâre going to lose users, you need to immediately pivot your business to figuring out how to make your components more reliable or you are necessarily going to make less money.
The point is that even as the provider of the bottom layer of everything computers rely upon, youâre likely aware that you canât be perfect. You cannot deliver hardware components to all of your customers that will function properly all of the time. Some of these components will eventually fail. Some will even be shipped in a bad state. Know this and use this knowledge to make sure youâre only aiming to prevent the correct amount of failures. Youâll never prevent 100% of them, so pick a target that doesnât upset your users and that you wonât have to spend infinite resources attempting to attain.
Beyond just hardware
In an absolutely perfect world, all SLOs would be built from the ground up. Since anything that is dependent on another system cannot strictly be more reliable than the one it depends on, it would be most optimal if each dependency in the chain had a documented reliability target.
For example, power delivery is required for all computer systems to operate. So, perhaps the first step in your own reliability story is knowing how consistently reliable levels of electricity are being delivered to the racks that your servers reside in. Then you have to consider if those racks have redundant circuits providing power. Then you have you consider the failure rates of the power supply units that deliver power to the other components of your servers. This goes on and on.
Tip
Donât be afraid of applying the concepts outlined in this book to things that arenât strictly software-based services. In fact, remember from the Preface that this same approach can likely be applied to just about any business. Chapter 5 covers some of the ways in which you can use SLOs and error budgets to address human factors.
Choosing Targets
Now that weâve established that you shouldnât try to make your target too high, and that your target doesnât have to be comprised of just the number nine many times in a row, we need to talk about how you can pick the correct target.
The first thing that needs to be repeated here is that SLOs arenât SLAsâthey arenât agreements. When youâre working through this process, you should absolutely keep in mind that your SLO should encompass things like ensuring your users are happy and that you can make actionable decisions based upon your measured performance against this SLO; however, you also need to remember that you can change your SLO if the situation warrants it. There is no shame in picking a target and then changing it in short order if it turns out that you were wrong. All systems fail, and that includes humans trying to pick magic numbers.5
Past Performance
The best way to figure out how your service might operate in the future is studying how it has operated in the past. Things about the world and about your service will absolutely changeâno one is trying to deny that. But if you need a starting point in order to think about the reliability of your service, the best starting point youâll likely have is looking at its history. No one can predict the future, and the best alternative we have is extrapolating from the past.
Note
You may or may not want to discount previous catastrophes here. Severe incidents are often outlier events that you can learn important lessons from, but are not always meaningful indicators in terms of future performance. As always, use your own best judgment, and donât forget to account for the changes in the robustness of your service or the resilience of your organization that may have come from these lessons learned.
All SLOs are informed by SLIs, and when developing your SLIs, it will often be the case that youâll have to come up with new metrics. It will sometimes be the case that you might need to collect or export data in entirely new ways, but other times you might determine that your SLI is a metric youâve already been collecting for some amount of time.
No matter which of these is true, you can use this SLI to help you pick your SLO. If itâs a new metric, you might have to collect it for a while firstâa full calendar month is often a good length of time for this. Once youâve done that, or if you already have a sufficient amount of data available, you can use that data about your past performance to set your first SLO. Some basic statistics will help you do the math.
Tip
Even if you have a solid grasp of basic statistics, you might still find value in reading about how to use these techniques within an SLO-specific context. Chapter 9 covers more advanced statistical techniques.
Basic Statistics
Statistical approaches can help you think about your data and your systems in incredibly useful ways, especially when you already have data you can analyze. Weâll go into much more depth on the math for picking SLO targets in various chapters in Part II, but this section presents some basic and approachable techniques you can use to analyze the data you have available to you for this purpose. For some services, you might not even need the more advanced techniques described in future chapters, and you might be able to rely mostly on the ones outlined here.
That being said, while weâll tie basic statistical concepts to how they relate to SLOs in the next few pages, those who feel comfortable with the building blocks of statistical analysis can skip ahead to âMetric Attributesâ.
The five Ms
Statistics is a centuries-old discipline with many different uses, and you can leverage the models and formulae developed by statisticians in the past to help you figure out what your data is telling you. While some advanced techniques will require a decent amount of effort to apply correctly, you can get pretty good insight into how an SLI is performing, and therefore what your SLO should look like, with basic math.
The building blocks of statistical analysis are five concepts that all begin with the letter M: min, max, mean, median, and mode. These are not complicated concepts, but they can give you excellent insight into time seriesâbased data. In Table 4-3 you can see an example of a small time series dataset (known as a sample, indicating that it doesnât represent all data available but only some portion of it).
Time | 16:00 | 16:01 | 16:02 | 16:03 | 16:04 | 16:05 | 16:06 | 16:07 | 16:08 | 16:09 |
Value | 1.5 | 6 | 2.4 | 3.1 | 21 | 9.1 | 2.4 | 1 | 0.7 | 5 |
When dealing with statistics itâs often useful to have things sorted in ascending order, as shown in Table 4-4, so while the time window from which youâve derived your sample is important for later context, you can throw it out when performing the statistics weâre talking about here.
Value | 0.7 | 1 | 1.5 | 2.4 | 2.4 | 3.1 | 5 | 6 | 9.1 | 21 |
The min value of a time series is the minimum value observed, or the lowest value. The max value of a time series is the maximum value observed, or the highest value. These are pretty easily understood ideas, but itâs important that you use them when looking at SLI data in order to pick proper SLOs. If you donât have a good understanding of the total scope of possibilities of the measurements youâre making, youâll have a hard time picking the right target for what these measurements should be. Looking at Table 4-4, we can see that the min value of our dataset is 0.7 and the max value is 21.
The third M word you need to know is mean. The mean of a dataset is its average value, and the words mean and average are generally interchangeable. A mean, or average, is the value that occurs when you take the sum of all values in your dataset and divide it by the total number of values (known as the cardinality of the set). We can compute the mean for our time series via the following equation:
Note
There is nothing terribly complicated about computing a mean, but it provides an incredibly useful and simple insight into the performance of your SLI. In this example, we now know that even though we had a min value of 0.7 and a max value of 21, the average during the 10 minutes of data that weâre analyzing was 5.52. This kind of data can help you pick better thresholds for your SLOs. Calculating the mean value for a measurement is more reliable than looking at a graph and trying to eyeball what things are ânormallyâ like.
The fourth M word is median. The median is the value that occurs right in the middle. In our case, we are looking at a dataset that contains an even number of values, so there is no exact middle value. The median of the data in situations like this is the mean of the two middle values. In our case this would be the 5th and 6th values, or 2.4 and 3.1, which have a mean of 2.75.
The median gives you a good way to split your data into sections. Itâll become clearer why that is useful when we introduce percentiles momentarily, but what should hopefully be immediately clear is that the mean for this data is higher than the median value. This tells you that you have more values below your average than you have above itâin this case, 7 values compared to 3âwhich lets you know that the higher-value observations happen less frequently, and that they contain outliers. Knowing about outliers can help you think about where to set thresholds in terms of what might constitute a good observation versus a bad one for your service. Sometimes these outliers are perfectly fine in the sense that they donât cause unhappy users, and other times they can be indicative of severe problems, but at all times outliers are worth investigating more to know which category they fit into.
The fifth and final M word is mode. The mode of a dataset is the value that occurs most frequently. In our example dataset the mode is 2.4, because it occurs twice and all the other values occur only once. When no value occurs more than once, there is no mode. When multiple values occur at the same frequency, the dataset is said to be multimodal. The concept of counting the occurrences of values in a sample is very important, but is much better handled via things like frequency distributions and histograms, which are introduced in Chapter 9. The mode is only included here for the sake of completeness in our introduction to statistical terminology.
Ranges
Another important basic statistical concept is that of a range, which is simply the difference between your max value and your min value, which lets you know how widely distributed your values are. In our sample (Table 4-4), the min value is 0.7 and the max value is 21. The math to compute a range is just simple arithmetic:
Ranges give you a great starting point in thinking about how varied your data might be. A large range means you have a wide distribution of values; a small range means you have a slim distribution of values. Of course, what wide or slim could mean in your situation will also be entirely dependent on the cardinality of the values youâre working with.
While ranges give you a great starting point in thinking about how varied your dataset is, youâll probably be better served by using the concept of deviations. Deviations are more advanced ways of thinking about how your data is distributed; Chapter 9 talks about how to better think about the distribution, or variance, of your data.
Percentiles
A percentile is a simple but powerful concept when it comes to developing an understanding of SLOs and what values they should be set at. When you have a group of observed values, a percentile is a measure that allows for you to think about a certain percentage of them. In the simplest terms, it gives you a way of referring to all the values that fall at or below a certain percentage value in a set.
For example, for a given dataset, the 90th percentile will be the threshold at which you know that all values below the percentile are the bottom 90% of your observations, and all values above the percentile are the highest 10% of your observations.
Using our example data from earlier, values falling within the 90th percentile would include every value except the 10th one. When working with percentiles youâll often see abbreviations in the form PX, where X is the percentile in question. Therefore, the 90th percentile will often be referred to as the P90. If you wanted to isolate the bottom 50% of your values, you would be talking about values below the 50th percentile, or the P50 (which also happens to be the median, as discussed previously). While percentiles can be useful at almost any value, depending on your exact data, there are also some common levels at which they are inspected. You will commonly see people analyzing data at levels such as the P50 (the median), P90, P95, P98, and P99, and even down to the P99.9, P99.99, and P99.999.
When developing SLOs, percentiles serve a few important purposes. The first is that they give you a more meaningful way of isolating outliers than the simpler concept of medians can. While both percentiles and medians split your data into two setsâbelow and aboveâpercentiles let you set this division at any level. This allows you to look at your data split into many different bifurcations. You can use the same dataset and analyze the P90, P95, and P99 independently. This kind of thinking allows you to address the concept of a long tail, which is where your data skews in magnitude in one direction or the other, but perhaps not with a frequency that is meaningful.
The second way that percentiles are useful in analyzing your data for SLOs is that they can help you pick targets in a very direct manner. For example, letâs say that you calculate the P99 value for a month of data about successful database transaction times. Once you know this threshold, you now also know that if you had used it as your SLO target, you would have been reliable 99% of the time over your analyzed time frame. Assuming performance will be similar in the future, you could aim for a 99% target moving forward as well.
Note
Calculating common percentiles based upon your current data is a great starting point in choosing an initial target percentage. Not only do they help you identify outliers that donât reflect the general user experience, but your metrics data will sometimes simply be more useful to analyze with those outliers removed. Another common way to achieve these results and better understand your data involves histograms, which weâll discuss in Chapter 9.
Metric Attributes
A lot of what will go into picking a good SLO will depend on exactly what your metrics themselves actually look like. Just like everything else, your metricsâand therefore your SLIsâwill never be perfect. They could be lacking in resolution, because you cannot collect them often enough or because you need to aggregate them across many sources; they could be lacking in quantity, since not every service will have meaningful data to report at all times; or they could be lacking in quality, perhaps because you cannot actually expose the data in the exact way that you wish you could.
Chapter 1 talked about how an SLI that allows you to report good events and total events can inform a percentage. Though this is certainly not an incorrect statement, it doesnât always work that way in the real worldâat least not directly. While it might be simplest to think about things in terms of good events over total events, itâs not often the case that your metrics actually correlate to this directly. Your âeventsâ in how this math works might not really be âeventsâ at all.
Resolution
One of the most common problems youâll run into when figuring out SLO target percentages revolves around the resolution of your metrics. If you want a percentage informed by good events over total events, what do you do if you only have data about your service that is able to be reportedâor collectedâevery 10, 30, or 60 seconds?
If youâre dealing with high-resolution data, this is probably a moot point. Even if the collection period is slow or sporadic, you can likely just count aggregate good and total events and be done with it.
But not all metrics are high resolution, so you might have to think about things in terms of windows of time. For example, letâs say you want a target percentage of 99.95%. This means that youâre only allowed about 43 seconds of bad time per day:
To work this out, you first subtract your target percentage from 1 to get the acceptable percentage of bad observations. Then, to convert this into seconds per day, you multiply that value by 24 (hours per day), then by 60 (minutes per hour), then by 60 again (seconds per minute).
In this case, you would exceed your error budget with just a single bad observation if your data exists at a resolution of 60 seconds, since you would immediately incur 60 seconds of bad time. There are ways around this if these bad observations turn out to be false positives of some sort. For example, maybe you need for your metric to be below (or above) a certain threshold for two consecutive observations before you count even just one of them against your percentage. However, this also might just mean that 99.95% is the wrong target for a system with metrics at that resolution. Changing your SLO to 99.9% would give you approximately 86 seconds of error budget per day, meaning youâd need two bad observations to exceed your budget.
Exactly what effect your resolution will have on your choice of reliability target is heavily dependent on both the resolution and the target at play. Additionally, the actual needs of your users will need to be accounted for. While we will not enumerate more examples, because theyâre potentially endless, make sure you take metric resolution into consideration as you choose a target.
Quantity
Another common problem you could run into revolves around the quantity of your metrics. Even if you are able to collect data about your service at a one-second interval, your service might only have an event to report far less frequently than that. Examples of this include batch scheduler jobs, data pipelines with lengthy processing periods, or even request and response APIs that are infrequently talked to or have strong diurnal patterns.
When you donât have a large number of observations, target percentages can be thrown for a loop very quickly. For example, if your data processing pipeline only completes once per hour, a single failure in a 24-hour period results in a reliability of 95.83% over the course of that single day. This might be totally fineâone failure every day could actually be a perfectly acceptable state for your service to be inâand maybe you could just set something like a 95% SLO target to account for this. In situations like this, youâll need to make sure that the time window you care about in terms of alerting, reporting, or even what you display on dashboards is large enough to encompass what your target is. You can no longer be 95% reliable at any given moment in time; you have to think about your service as being 95% reliable over a 24-hour period.
Even then, however, you could run into a problem where two failures over the course of two entire days fall within 24 hours of each other. To allow for this, you either have to make the window you care about very large or set your reliability target down to 90%, or even lower. Any of these options might be suitable. The important thing to remember is that your target needs to result in happy users over time when youâre meeting it, and mark a reasonable point to pivot to discussing making things more reliable when youâre not.
For something like a request and response API that either has low traffic at all times or a diurnal (or other) pattern that causes it to have low traffic at certain times, you have a few other options in addition to using a large time window.
The first is to only calculate your SLO during certain hours. There is nothing wrong with saying that all time periods outside of working hours (or all time periods outside of 23:00 to 06:00, or whatever makes sense for your situation) simply donât count. You could opt to consider all observations during those times as successes, no matter what the metrics actually tell you, or you could just ignore observations during those times. Either of these approaches will make your SLO target a more reasonable one, but could make an error budget calculation more complicated. Chapter 5 covers how to better handle error budget outliers and calculations in more detail.
The other option available to you is using some advanced probability math like confidence intervals. Yes, that is a complicated-sounding phrase. Yes, they are not always easy to implement. But donât worry, Chapter 9 has a great primer on how this works.
Services with low-frequency or low-quantity metrics can be more difficult to measure, especially in terms of calculating the percentages you use to inform an SLO target, but you can use some of these techniques to help you do so.
Quality
The third attribute that you need to keep in mind for the metrics informing your SLIs and SLOs is quality. You donât have quality data if itâs inaccurate, noisy, ill-timed, or badly distributed. You could have large quantities of data available to you at a high resolution, but if this data is frequently known to be of a low quality, you cannot use it to inform a strict target percentage. That doesnât mean you canât use this data; it just means that you might have to take a slightly more relaxed stance.
The first way you can do this is to evaluate your observations over a longer time window before classifying them as good or bad. Perhaps you measure things in a way that requires your data to remain in violation of your threshold in a sustained manner for five or more minutes before you consider it a bad event. This lowers your potential resolution, but does help protect against noisy metrics or those prone to delivering false positives. Additionally, you can use percentages to inform other percentages. For example, perhaps you require 50% of all the metrics to be in a bad state over the five-minute time window before you consider that you have had a single bad event that then counts toward your actual target percentage.
Note
Using the techniques weâve described here, as well as those covered in Chapters 7 and 9, you can even make low-quality data work for youâthough you just might need to set targets lower than what your service actually performs like for your users. As long as your targets still ensure that users are happy when you exceed them and arenât upset until you miss them by too much for too long, theyâre totally reasonable. It doesnât matter what the percentage actually is. Donât let anyone trick you into thinking that you have to be close to 100% in your measurements. Itâs simply not always the case.
Percentile Thresholds
When choosing a good SLO, itâs also important to think about applying percentile thresholds to your data. Itâs rarely true that the metrics that youâre able to collect can tell you enough of the story directly. Youâll often have to perform some math on them at some point.
The most common example of this is using percentiles to deal with value distributions that might skew in one direction or the other in comparison with the mean. This kind of thing is very common with latency metrics (although not exclusive to them!), when youâre looking at API responses, database queries, web page load times, and so on. Youâll often find that the P95 of your latency measurements has a small range while everything above the P95 has a very large range. When this happens, you may not be able to rely on some of the techniques weâve outlined to pick a reliable SLO target.
Letâs consider the example web page load time again, since itâs a very easy thought experiment everyone should be familiar with. Youâre in charge of a website, and youâve done the research and determined that a 2,000 ms load time keeps your users happy, so thatâs what you want to measure. But once you have metrics to measure this, you notice that your web pages take longer than 2,000 ms to load a full 5% of the time, even though you havenât been hearing any complaints. You could just set your target at 95%âand this might not be the wrong moveâbut you gain a few advantages by using percentiles instead. That is, you could say that you want web pages to load within 2,000 ms at the 95th percentile, 99.9% of the time.
The primary advantage of this approach is that you can continue to monitor and care about what your long tail actually looks like. For example, letâs say that your P95 observed values generally fall below 2,000 ms, your P98 values fall below 2,500 ms, and your P99 values fall below 4,000 ms. When 95% of page loads complete within 2 seconds, your users may not care if an additional 4% of them take 4 seconds; they may not even care if 1% of them take 10 seconds or time out entirely. But what users will care about is if suddenly a full 5% of your responses start taking 10 seconds or more.
By not just caring about the bottom 95% of your latency SLI by setting an SLO target of 95%, and instead caring about a high percentage of your P95 completing quickly enough, you free yourself up to look at your other percentiles. Based on the preceding examples, you could now set additional SLOs that make sure your P98 remains below 2,500 ms and your P99 remains below 4,000 ms. Now you have three targets that help you tell a more complete story, while also allowing you to notice problems within any of those ranges independently, instead of just discarding some percentage of the data:
The P95 of all requests will successfully complete within 2,000 ms 99.9% of the time.
The P98 of all requests will successfully complete within 2,500 ms 99.9% of the time.
The P99 of all requests will successfully complete within 4,000 ms 99.9% of the time.
With this approach, youâll be able to monitor if things above the 95th percentile start to change in ways that make your users unhappy. If you try to address your long tail by setting a 95% target, youâre discarding the top 5% of your observations and wonât be able to discover new problems there.
Another advantage of using percentiles is a reporting one. This book preaches an idea that may have been best summarized by Charity Majors: âNines donât matter if users arenât happy.â While this is true, a long tail accommodated by a lower target percentage can be misleading to those newer to SLO-based approaches. Instead, you can use percentiles to make your written SLO read more intuitively as a âgoodâ one. You shouldnât set out to purposely mislead anyone, of course, but you can always choose your language carefully so as to not alarm people.
What to Do Without a History
What should you do if youâre trying to set SLOs for a service that doesnât yet exist, or otherwise doesnât have historical metrics for you to analyze? How can you set a reasonable target if you donât yet have any users, and therefore may not know what an acceptable failure percentage might look like for them? The honest answer is: just take an educated guess!
SLOs are objectives, theyâre not formal agreements, and that means you can change them as needed. While the best way to pick an SLO that might hold true into the future is to base it on data youâve gathered, not every SLO has to hold true into the future. In fact, SLO targets should change and evolve over time. Chapter 14 covers this in depth.
There may be other sources of data you can draw upon when making your educated guessâfor example, the established and trusted SLOs of services that yours might depend on, or ones that will end up depending on yours. As youâve seen, your service canât be more reliable than something it has a hard dependency on, so you need to take that into account when picking an initial target.
Itâs also true that not every service has to have an SLO at launchâor even at all. An SLO-based approach to reliability is a way of thinking: are you doing what you need to do, and are you doing so often enough? Itâs about generating data to help you ask those questions in a better way. If your service doesnât yet have the metrics or data you need to be able to ask those questions in a mathematical way, you can always make sure youâre thinking about these things even without that data.
However, there are ways to make sure youâre thinking about SLIs and SLOs as you architect a service from the ground up. Chapter 10 discusses some of these techniques.
Summary
Every system will fail at some point. Thatâs just how complex systems work. People are actually aware of this, and often okay with it, even if it doesnât always seem obvious at first. Embrace this. You can take actions to ensure you donât fail too often, but there isnât anything you can do to prevent every failure for all timeâespecially when it comes to computer systems.
But if failures donât always matter, how do you determine when they do matter? How do you know if youâre being reliable enough? This is where choosing good SLOs comes in. If you have a meaningful SLI, you can use it to power a good SLO. This can give you valuable insight into how reliable your users think you are, as well as provide better alignment about what âreliableâ actually means across departments and stakeholders. You can use the techniques in this chapter to help you start on your journey to doing exactly that.
1 Think, for example, about Hyrumâs law, discussed in Chapter 2. Thereâs also a great story about Chubby, a dependency for many services at Google, in Chapter 5.
2 Trying to be 99.999% reliable over time means you can be operating unreliably for less than one second per day and only about 5 minutes and 15 seconds over the course of an entire year (Chapter 5 discusses how to do this math). This is an incredibly difficult target to reach. Even if your services are rock solid, everything depends on something else, and it is often unlikely that all of those dependencies will also consistently operate at 99.999% for extended periods of time.
3 These numbers were calculated via an excellent tool written by Kirill Miazine and are based upon calculations that assume a year has 365.25 days in order to account for leap years.
4 Actually identifying all of your dependencies, however, is a complicated ordeal. This is why you need to measure the actual performance of your service, and not just set your targets based upon what your known dependencies have published. You almost certainly have dependencies you donât know about, and your known dependencies arenât all going to have perfectly defined SLOs.
5 Chapter 14 discusses how to evolve your SLO targets in great detail.
Get Implementing Service Level Objectives now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.