Chapter 4. Measure and Experiment

Up to this point our model has consisted of deploy, release, and operate. For the final addition to our continuous operations model, we need to measure the impacts of our changes through data collection and run experiments against that data to validate whether those changes yielded the results we expected or desired. A robust measurement and experimentation practice yields treasure troves of data that can inform better product direction, reduce the risk of degradations in service, or even give an early warning sign of misaligned feature development.

In Chapter 3, we briefly touched on monitoring and observability solutions. We discussed that application monitoring tools often rely on reactive alerting techniques to inform operators about changes to application and platform performance. In contrast, the measure and experiment stage focuses on quantifying the efficacy of the features you are releasing to end users more proactively. As teams progress along their continuous operations and feature management journey, understanding the metrics that support a successful feature release—and, in some cases, support rolling back a feature—becomes increasingly important. These measurements have a direct relationship to the features you are releasing separately from your software deployment. It’s necessary to shift measure and experiment practices closer to the release stage (see Figure 4-1).

The measure and experiment stage
Figure 4-1. The measure and experiment stage

But why is measure a better solution for this stage than monitor? What should be measured, and why? What are the different types of experimentation, who should deploy them, and when should they be used? Answering these questions helps create a better understanding of how users are experiencing your service, and each answer informs what should be built next. These questions, and ultimately the practices that answer them, are especially valuable to members of the software delivery team such as product managers, who consistently find themselves asking other questions, such as:

  • Does this change negatively impact a performance metric?

  • Is this release “safe” to release to all users?

  • Will this change help us achieve our target key performance indicators (KPIs)—e.g., cart size, sign-up rate, user click-through?

You’ll note that each of these questions ties very closely to the experience of users in their utilization of the application itself. Monitoring practices are great at measuring performance-based application KPIs, such as latency or connection rate, but can’t always address action-based application KPIs, such as sign-up rates or click-throughs. These action-based metrics are ones that teams want to measure, to help them understand whether revisions to their software are resulting in improvements, no change, or negative impacts.

Another advantage of measure and experiment processes is that they help remove aspects of release bias that may exist within a given feature release. As an example, in early stages of feature development, it’s easy to fall into the trap of assuming that “version 2” of your feature is a direct improvement on “version 1.” However, as you reduce your batch (code push) size and increase your deployment frequency, adjustments and iterations will also occur on existing components and capabilities within your application. As a result, it’s fair to assume that at a high deployment frequency, not every design decision is going to result in positive changes to the application. In other words, version 2 may not be better than version 1. Measuring and experimentation practices reduce the risk of this bias occurring by providing teams with data and quantifiable metrics that measure against predefined goals or KPIs.

Now that we’ve defined where measure and experiment fit into the continuous operations framework, we can start to explore the what, why, and how a bit more. In this chapter, we’ll detail why measuring is a better practice for tracking/interpreting data and discuss different forms of measurement and experimentation such as feature validation, risk mitigation, and optimization.

Measure

Measure is the act of identifying, tracking, and learning from activities and events that occur within your application or from user interactions with your application. Measure includes both understanding key success metrics across release processes and experimenting with variations to understand the desired outcomes.

Why We Need to Measure Data

Measuring is trying to solve for one thing: “Let’s see what happened, bad or good.” Either type of outcome is useful for future product decision making or risk mitigation. Measurement reflects the curiosity for validation that drives disciplines such as product management, software engineering, product design, and more. As we’ll see later in this chapter, establishing a strong measurement framework and data collection strategy is a foundational component of experimentation.

Learning what worked well is a key component driving future decisions and requires that we have accurate information for our findings. Measuring something that results in greater interactions with parts of your application, or greater consumption of a newly implemented service, invites further investigation and analysis. If this new release resulted in greater engagement from users, what else can we do across the service to have a similar effect? As a contrast, monitoring might have us saying, “I’m not seeing any errors,” but measuring lets us say, “Whoa, we’ve got a winner here, check this out.”

What Impact Does Measure Have?

The bottom line is: understanding what increases a desirable metric or decreases an undesirable one is valuable. Perhaps adopting pagination in one part of the product increased user session length, but adding a new menu interface created confusion and increased page exits. Maybe adding pagination in other areas of the product would create additional engagement, and we should revert to our old menus. Either way, it’s going to be worth it to measure further.

If we didn’t adopt a measurement process and instead relied on a monitoring solution, we might miss some of these insights. Our monitoring tools’ response to these changes might just be “Everything looks good, nothing broke,” and then we as a product team gain no insights into how well our changes have been received.

Shifting from Measure to Experiment

Up to this point we’ve talked about the benefits of effective measuring capabilities, but we’ve also talked about the downstream impacts that data can have. However, these impacts are found not only by collecting data but also by running experiments. Experiments take the data from proactively measurable events, assess how those events affect an outcome against a hypothesis, and provide insights for possible future decisions.

Experimentation comes in many different flavors, use cases, and best practices. In the next section, we will cover different types of experiments, when to use each one, which roles should be involved, why they should be utilized, and what should be gleaned. We will also discuss when monitoring, as opposed to experimentation, is the best option for specific scenarios.

Experimentation

An experiment always starts with a question and a hypothesis. The question could be something like, Which sign-up flow results in the most completions? or, What search algorithm creates the most bookings? Once a question has been established, we can form a hypothesis based on past experiences, previous data, or any number of influences. At its core, experimentation requires three essential elements: a randomized control trial, metrics, and statistics. The randomization gives you unbiased causal inference, and the statistics help you understand whether any observed difference is meaningful.

New Software Delivery Enables Experimentation

Fortunately, the evolution of software deployments has made it easier for organizations to run experiments. Back when software used to come on a CD, running experiments was difficult. Maybe you could have run an experiment by shipping different versions to different markets and soliciting feedback from users. This might have yielded some useful info for the next iteration of your software, but overall the feedback cycle was very slow.

As software delivery has evolved to become more digital and less reliant on physical copies, feedback cycles have gotten shorter, and experiments have become easier to design and run. Experiments can be run on individual features (through their implemented feature flags), deployment versions, or geographies/user groups. Results can be measured in real time, allowing for quick iteration based on the outcomes. Experimentation allows organizations to base decisions on quantifiable data from user behavior rather than on instinct, gut feelings, or prior biases.

But is experimentation really necessary? Organizations that do not have a culture of experimentation are in danger of making suboptimal decisions and missing out on learning from actual usage. They run the risk of unnecessary development cycles based on assumption-based iteration without product-informed user interaction data. Decisions are loudest voice or “HiPPO” (highest-paid person’s opinion) driven rather than outcome driven. In the worst-case scenario, a release without any experimentation can result in a user dissatisfaction rollback, which is when a feature negatively impacts the user experience so much that it needs to be pulled from production. Let’s walk through the structure of a feature validation experiment and where the product team fits into this type of experiment.

Feature Validation Experiments

Feature validation experiments let teams test features with a subset of users to determine whether that feature provides value or needs to be reconsidered. This type of experiment should happen during the release stage and can be run by product managers and/or developers.

The purpose of a feature validation experiment is to gather data from user interactions and then use that data to determine the benefits of the feature and inform future product roadmap discussions. If the new feature improves the user experience, the data will demonstrate that. If not, the insights gathered from the experiment should prove valuable for future planning. Ultimately, these experiments act as a feedback loop from user to organization and can validate whether expected user behavior aligns with actual user behavior. Understanding what users actually do when interacting with your product makes it easier to build the features that support them.

Typically it’s the product managers that control feature validation experiments. Product managers are tasked with managing the production and implementation of a new feature and are measured by shipping features that impact users. Additionally, they are responsible for providing insights into why a proposed feature will have the desired impact during roadmap planning, and subsequently demonstrating that that feature had the expected impact once it was released to customers. It’s a position that relies heavily on customer data and research.

In the legacy model of software deployments, product managers were forced to rely on qualitative research data from both industry and customer sources until a new feature was fully launched. It was only after the release that product managers were able to gather quantitative data on how the feature performed in the real world. Unfortunately, this came with its own set of challenges. Once a feature was broadly deployed, product managers were tasked with finding groups they could survey on their experience while also managing the fact that the entire user base was now consuming the same features.

As the software delivery model has moved forward, feature validation experiments can now be used to help product managers get new features into select users’ hands faster and in a more controlled manner, providing them with data that can be used to iterate and release features with more confidence.

So how should product managers get started? The first step is to build a minimum viable product (MVP) of a new feature. The product manager should figure out the smallest, easiest denomination or divisible unit that a feature can be boiled down to and then build that specific configuration. Once it’s completed, the MVP feature should be released to a small subset of users. The product manager then compares the behavior of those users with the new version or variant against the behavior of the control group. If the variant is performing better than the control against the metrics of success (more on this later), then the product manager can know with quantifiable confidence that the new feature is worth developing further.

Incentive-Based Development

As is the case with any experiment, the variant doesn’t always win. Sometimes the control group performs better than the new variant against a set of prescribed metrics. Many product teams may consider this a “bad” result. The consensus on the team might be, “Users didn’t like that thing we built, and we wasted cycles on development!”

This is where something potentially troubling can happen. The definition of success can shift to match a more favorable outcome. What was initially viewed as unsuccessful may be reinterpreted as supporting a team’s goals. This redefining of success is typically caused by a misalignment of incentives.

In some organizations, members of software delivery teams are incentivized by the quantity of shipped features, not by the magnitude of impact. Internally, shipping a big feature creates lots of fanfare among peer groups, attention from leadership, and emoji responses in the team Slack. Externally, customers notice the new feature, press releases go out touting the achievements and impact on the company, and so on. All this coverage and visibility of the release makes sense. After all, the goal of building new software is ultimately to release it to user communities, and the more features that are shipped, the greater the acclaim.

The problem with incentivizing development teams on quantity, not quality, is that there are times when an enhancement they developed has little impact on metrics that matter. Or worse, the developed item has a negative impact when compared to the current user experience. Releasing features that create a negative user experience results not only in wasted time and money (in the form of salaries) but also in lowered morale and confidence in the product’s vision. No one wants to spend time working on projects that are perceived to not matter or to create adverse effects. Today, developers have a lot of choice in where they work thanks to an asymmetric labor market. Spending months working on something that lands with a thud and is quietly pulled two months later doesn’t help with employee retention.

Teams that are tied to these types of incentive-based structures may try to “massage” the success metrics when confronted with data that shows that the control outperforms the variant. Perhaps the poor results were from a small sample size? Or maybe an unrepresentative test group? Maybe an outlier skewed the data, or the results don’t align with customer interviews. In any of these cases, there could be a number of legitimate reasons that contributed to the results being worse than expected. The point here is that organizations should be careful in how they measure the success of feature shipments and should ensure that the quality of each release is being considered along with the quantity of features being released.

The presence of a healthy experimentation culture reduces the “ship or die” incentives that can color decision making. Evaluating product managers on customer impact removes the incentive to explain away bad data during MVPs.

Pitfalls of Changing Success Metrics

We all have had experiences or can likely point to examples that highlight a time when precious resources were wasted on features or products that didn’t work out. One such example is from a well-known social media company.

A New Product in a New Market

Facing competition from a couple of high-growth upstarts, this social media company devoted an enormous amount of resources (developer time, product managers, opportunity cost of delaying existing roadmap features, etc.) to building a new product to keep up with its new competitors. When the project reached MVP stage, the company tested it in an international, non-English-speaking market that typically didn’t draw the attention of US-based tech journalists.

To define success, the company decided to use revenue and new engagement (time spent on the feature) as the primary success metrics. If people clicked on ads and engaged with the platform through this new product, then development of the product could and should be continued.

Considerable resources were amassed and deployed, and the product was released to the test country. When the results came in, they showed that engagement on the new product was high, but revenue and engagement on the entire platform were both flat. The company was confronted with a tricky question: should it count the high product engagement as a positive enough signal that the project was worth pursuing? Or should it more heavily weigh the flat revenue and overall platform engagement metrics and refocus resources elsewhere?

Since the success metrics were new engagement and revenue, victory could plausibly be declared. Look at all this engagement on the new product! The people tasked with bringing the product to market had a powerful incentive to forge ahead. Their career prospects and internal reputation would grow if they had a major hand in developing a high-profile, impactful product.

These incentives caused them to either ignore or willfully diminish a serious problem: users were spending more time on the new product, but overall platform usage and revenue were flat. The revenue and engagement gains weren’t “new”; they were cannibalized from existing products. People didn’t spend more time on the company’s products overall; they just shifted from the standard version to the new, novel product.

Data scientists at the company raised the alarm that the new product’s engagement growth masked a lack of growth overall. If users toyed around with the new product at the expense of the rest of the platform, then building the new product wouldn’t be worth the effort. Why invest so many resources to end up at the same engagement levels as the existing product?

Up to this point, everything in the story can reasonably be filed away in the “sometimes you’re right, sometimes you’re wrong” category. The fact that the company even tested the new product in a geographically isolated market demonstrates better-than-average operational capability.

The critical error came not in the proposal of the idea, the building of the MVP, or the limited product release. The wrong turn happened when the product team, eager to justify continued investment in the project, changed the success metrics.

Facing the Results

The product team leading this effort probably made reasonable arguments based on its own interpretations of the results. As often happens, though, personal investment in the project may have led to a biased opinion about the viability of the new product. These well-intentioned individuals wanted their project to continue, and abandoning it felt like failure. These types of influences can alter how product teams interpret the information used to make critical product decisions.

At the start of the experiment, the team defined success as meeting certain thresholds of product and platform engagement. When the results came back showing high levels of product engagement but flat levels of broader platform engagement, the team de-emphasized the platform engagement metrics in favor of the product engagement.

As a result of this revised success definition, the new product received continued investment. Developers, designers, user researchers, and product managers spent more time on the project, and it was eventually launched worldwide to great fanfare. Unfortunately, customer response was tepid. Just as the experiment data showed, the new product didn’t bring in new users and instead drew most of its audience from the existing user base. Despite further investment after the troubling initial results, the product was eventually abandoned.

Overall, this was a pretty bad outcome for the company. Not only did it build something its customers didn’t want and didn’t use, but there was a large opportunity cost and lost resources as well. The time spent building this ill-fated product could have been better spent on other ideas or on improving other areas of the existing platform.

However, what shouldn’t be lost here is that the company’s idea to run an initial experimental pretest of the product was the right approach. It was the post-test product efforts and misinterpreted or misrepresented data that organizations should strive to avoid. Making incorrect guesses about user behavior is inevitable; misallocating resources after tests indicate a likely outcome is not.

An important lesson to be learned from this episode is that experimentation data is just that—data. How that data is interpreted and applied to decision making is still up to the product team, developers, and organizational leaders involved in product decisions. It’s important to set success metrics that are agreed on by all the stakeholders who could be impacted by the decision to move forward with a particular project, not just by those directly involved in its implementation.

Culture and Feature Abandonment

In Chapter 3, we discussed how creating a culture of accountability rather than a culture of blame will improve morale, increase employee retention, and reduce incident response times. Not punishing an engineer for creating or discovering a bug makes it more likely that the next engineer will be willing to come forward with a potential bug before it becomes too problematic. Product managers and other members of the software delivery team who are responsible for shipping and releasing features should similarly be encouraged to recognize an underperforming variant and should feel safe abandoning the feature or project.

Just as you’d rather have a high frequency of false-positive incident alerts, you’d also rather have product managers move on from ideas that don’t achieve early indicators of success against their control group. A common principle in the technology community is that it’s better to fail fast, iterate, and improve than to lose time trying to make something perfect the first time. If individuals feel that their career, perception of their work, or even their continued employment is put at risk by abandoning an underperforming project, the likelihood of shifting the goalposts for success grows.

While encouraging this sort of trial-and-error methodology may result in fewer features being shipped more broadly when compared to an organization with more traditional shipping incentives, teams should see an increase in release impact and effectiveness. In turn, this should directly translate to a greater rate of overall success for both the employees and the product. Establishing a culture of release effectiveness and success encourages product managers and teams to gracefully kill projects to maintain higher overall product quality.

In organizations that embrace this mentality, the only unshipped features are ones that should never be shipped in the first place. Not pursuing the features that fail early tests saves time and money in the long run. No organization wants to deliver features that users don’t want or won’t use. The best case in such a scenario is that the new feature benefits only a small percentage of users, adding value for some users and cluttering up the menus and screens for others. The worst case is that the problematic features need to be completely removed before they frustrate a large percentage of users and slow down engineering and documentation teams that need to keep tabs on them. Shipping only those features that measure as successful enables organizations to free up developers and product teams to focus on features or aspects of a product that have the highest impact. This could mean innovating and implementing new ideas or removing stale features that lead to detrimental technical debt or product bloat. Ultimately, encouraging product teams to trust the data in front of them creates greater agility.

Cheap Features

In Chapter 3, we talked about the goal of making incidents cheap and easy to declare so that they become commonplace, low-level, run-of-the-mill events that don’t cause panic or problems. A high volume of cheap incidents is less costly than a low volume of expensive, high-impact incidents. With feature development, the same idea applies.

Features should have a similar cheapness to them. The likelihood that people in an organization consistently know exactly what their users want is pretty low. Utilizing data aggregation methods such as direct user interactions, surveys or interviews, or an experimentation platform that is tightly coupled with your feature management platform ensures that teams can easily run experiments to validate their ideas and helps lower the cost of features. If features themselves are viewed as cheap, then product managers will feel more comfortable abandoning the ones that underperform against control groups. It also has the added benefit of allowing these managers to prioritize organizational effort on higher-impact features.

While the idea of extending the cheapness metaphor to features is new to DevOps, the idea of not becoming too attached to things is not. “Pets versus cattle” is a useful analogy that has been around since the early 2010s. It describes how servers shouldn’t be individually indispensable, like a beloved pet, but instead should be anonymous and easy to replace, like cows in a herd. Instead of our relying on servers that could cause enormous problems if taken offline, servers should be designed for failure and built using automated tools. If one or more servers fail, they should be unceremoniously taken offline and replaced. Having automation tools spin up and down an array of servers protects an organization from an overreliance on any one individual server.

Features will never become as identical, anonymous, and easy-come-easy-go as servers, but the “pets versus cattle” analogy still has value at the feature validation stage. Becoming less attached to a specific feature idea makes it easier for organizations to make logical, rational decisions on whether to continue or end their investment.

A/B/n Testing

We’ve talked a lot about experimentation, and you might be trying to fully grasp what these experiments actually look like in the wild. One of the most common experiments is A/B/n testing. As a quick refresher for those unfamiliar with it, A/B/n testing is the process of presenting end users with different variations and then assessing the efficacy of those variations based on some collected metrics. As an example, let’s say that you want to see which variation of a menu bar gets the most user engagement on your website. You might create two or three different variations, randomize which one gets rendered when a user visits the page, and then collect engagement data, such as clicks, to determine which variation is best for your site. The larger the test group, the more confidence you can have about the results of your test.

Designing a Test

The first step in conducting an A/B/n test is to define the success criteria. These should be kept consistent before, during, and after the experiment. Doing so will reduce the chances that subjectivity colors the post-experiment analysis. Because of this, choosing the right criteria can be difficult and/or contentious depending on the stakeholders who will be impacted by the results. That said, most organizations typically have a small set of metrics that matter to them and could make for a good starting point; these include annual recurring revenue (ARR), monthly active users (MAUs), churn, customer acquisition cost (CAC), engagement, uptime, and cost of goods sold (COGS).

Another good practice is to keep the experiment simple. Choosing two or more metrics may sound enticing, but it also introduces complexity. Say, for instance, that you pick ARR and engagement as your metrics. An experiment on a new pricing algorithm might show an increase in ARR while also resulting in decreased engagement. Was that experiment a success? Deciding the answers to such questions after the experiment is completed lets prior narratives and incentives creep into the conversation. Being thoughtful about success criteria from the outset minimizes (but doesn’t eliminate) future complexity.

The next step is to identify a test group. Test groups can be completely randomized across all users or specified based on certain user attributes, such as geographic area, user ID number, and device type. You want the sample size to be high enough that results are statistically significant and can’t be dismissed due to outliers or edge cases.

Running the Test

When it comes to running an A/B/n test, the when is just as important as the who. You want the experiment to last long enough to gather enough data to make confident decisions, but it should be short enough that it won’t impede your development velocity. The experiment can be run for a set time interval or until a predetermined amount of data is collected.

As you can see, a lot goes into designing and implementing an experiment. You need an experimentation platform that offers randomization, sound statistical analysis, and tight integration with your deployment and release processes. As we mentioned earlier in this chapter, experimentation has three essential components: randomization, metrics, and statistics. You want to ensure that your platform randomizes properly based on the parameters you defined. If it doesn’t, you’ll be following the old adage “Garbage in, garbage out.” You also want to make sure the statistics you use allow you to make good decisions. A big part of good statistics is the design of the experiment. A platform that encourages and enforces good experimental design goes a long way toward providing statistical validity.

Choosing the Right Experimentation Platform

How an experimentation platform interacts with your software deployment process is also important. If the experimentation platform is overly separated from your deployment process, then you introduce some risks. If the way your platform executes an experiment is too far removed from the actual deployment pipeline, the winning variant may not be implemented correctly during the actual deployment. Or if you’re planning to target a certain audience, that audience targeting might not be organized the same way in the tools outside of the experimentation platform.

With well-defined success criteria, the right test groups, a clear testing interval, and a strong experimentation platform, you can run experiments and base product decisions on data rather than on overall opinions. Not to be overlooked: if your organization was previously relying on quantity-driven incentives, the cultural switch can be a challenge. This is largely due to the perception that feature shipping activity is a measure of success, but the results of this new methodology will lead to happier users and fewer surprises down the road. A/B/n testing allows you to test different versions side by side, thus letting you validate a new product idea before investing more resources in it.

Risk Mitigation

Software development is inherently full of risk. Problems can arise from any vector: your developers, your users, third-party SaaS vendors, open source libraries, and so on. Minimizing this risk is crucial to reducing incident severity and customer impact. In Chapter 3 we talked about the importance of risk mitigation as it relates to operations, but lots of actions related to this form of risk mitigation fall outside the scope of this chapter—for example, fostering a good culture so employees don’t leave unexpectedly, buying software versus building it, or adding extra redundancy into systems. For the purposes of this chapter, we’re going to focus on how experimentation can be used as a tool for risk mitigation.

Defining Risk Mitigation

Risk mitigation happens at all stages of the SDLC, and people in different roles take on this responsibility. The idea of risk mitigation usually brings to mind insurance: something that costs a little bit of money now in exchange for avoiding a large expenditure later. This is a useful way to think about risk mitigation. It also allows you to quantify the risk so you can decide whether the mitigation effort makes sense.

From the operational perspective, risk mitigation often comes in the form of controlled rollouts at the release stage; examples include techniques such as blue/green and canary deployments. These allow quick remediation if deployments don’t perform as expected. Using these techniques, developers, product managers, DevOps engineers, and marketers all have the ability to stop a canary or revert to the blue variant when issues arise. Later in the deployment lifecycle, during the operate and monitor phases, risk mitigation comes in the form of providing site reliability engineers (SREs) with the ability to identify and turn off features or systems that are negatively impacting users.

Each of these practices (blue/green deploys, canary deploys, canary rollouts, feature flagging) can be thought of as an experiment, since each option allows for a safer fallback alternative. So how does experimentation fit in with this? Earlier we discussed how experimentation can be used to determine whether or not a feature provides a hypothetical value to users. Well, simultaneously mitigating the risk of releasing a broken feature provides value to users. So in addition to feature assessment, developers can use experimentation to determine feature viability and identify potential breaking points prior to a broader release. In other words, experimentation allows them to mitigate the risk of a failed deployment.

Measuring Risk

The first step in mitigating risk is being able to effectively measure the level of risk for a given release. This means defining what risk means to an organization. For example, in B2B companies, risk may be quantified by calculating the value of deals that have breachable SLAs or by adding up spikes in monthly vendor costs due to incidents. At B2C companies, understanding how much revenue could be lost per unit of time inoperable gives a fairly clear quantification of risk.

These methods of quantifying risk are pretty straightforward since they can be directly correlated to impacts to the business itself. Some risk isn’t as easy to quantify because it may be less tangible or have less impact on measurable metrics. That said, mitigation has a demonstrable utility for decreasing measurable risk, as well as other ancillary benefits. Increased user trust, reduced anxiety around deployments and releases, and decreased time spent on incidents may not be as easy to quantify, but they certainly have positive impacts on an organization.

Trust is difficult to measure, but it’s necessary to the success of any product. Users need to trust that your service will do what it says on the box and not create unexpected downtime or degraded service. If users feel that your service creates extra headaches or additional friction, they will look for alternatives. On the flip side, users who trust your service will become champions of your platform. They might suggest new features, evangelize to others, offer feedback, or provide content that helps drive adoption up.

Like trust, reduced deployment or release anxiety is hard to quantify but provides immense value to an organization. If developers or product managers are nervous that a bad deployment or release will impact customers, create incidents, or draw angry attention from others in the organization, those employees will become hesitant to initiate deployment or release activities. Velocity will start to slow down as developers face more barriers and ship less. At some point, they may look for other jobs that allow them to ship faster.

Conversely, if you have a comparatively relaxed deploy or release cycle, positive feedback loops occur instead. Employees who aren’t scared of blowback will ship faster. This will result in happier employees who recruit their network to join you, decreasing recruiting costs and bringing people in faster to meet roadmap goals more quickly. Relaxed employees can also enjoy their time off more fully, knowing that if something does go wrong, the on-call team will be able to handle it and they won’t be called back to work.

We covered the benefits of decreased incident severity in Chapter 3. Having a culture of high-volume, low-severity incidents means fewer people pulled out of sprints, higher morale, and less distraction from priorities.

Experimentation for Risk Mitigation

So we’ve touched on all the reasons that measuring and mitigating risk improves an organization, but now we need to discuss the how. How can we use experimentation to mitigate risk? Experimentation doesn’t just have to mean frontend, “pink or blue sign-up box” decisions. Experiments designed with risk mitigation in mind can reduce the risk of negative outcomes. A cloud migration, for example, is a backend use case that is not visible to most end users but can carry a great deal of risk. When undertaking something like a cloud migration, it’s important to use experimentation techniques to ensure that the risks are addressed and that users will not experience negative outcomes because of something going wrong during the migration.

Migrating systems in any capacity carries a great deal of risk. That could apply to larger organizations undergoing a transition from large, on-premises legacy applications to cloud infrastructure or to smaller companies changing database systems. Risk of data loss, performance degradation, and downtime are all major concerns for these organizations. As a result, changes like these don’t happen overnight; they often come as part of a large organization-wide project that consists of many moving parts. Anything that can mitigate risk during this process will be a welcome addition.

Continuing our cloud migration example, when it comes time to make the switch, the team in charge of the migration will want to build its experiment: pick success metrics, move a small subset of traffic from the on-prem system to the cloud system, and measure the results. Let’s say that latency is a metric you care about. You tolerate 100 ms of latency on the on-prem system and hope that the cloud system will drop that by 20% to 80 ms. At the conclusion of the experiment, you see that latency has dropped to 90 ms—less than 100 ms, but not as big a drop as you would have liked. The purpose of running an experiment is the same as we talked about before—creating variations and testing a hypothesis—but in this case it has the added benefit of mitigating the risk associated with your migration.

Optimization

The last area of experimentation to cover is optimization. Feature validation is great for testing a small number of options against a control and/or each other, but not for dealing with more complex levels of variables. For those types of circumstances we’ll need to use our experimentation framework to find the optimal outcome from that set of variables.

Optimization experiments test which option out of many has the greatest impact on a defined metric. They happen at any stage, with product managers, developers, or designers in control. We’re still setting up an experiment as we have before—defining success metrics, testing variations based on a hypothesis, and measuring the result—but we’re doing so in hopes of choosing the best option available, not just deciding between one or the other.

One difference with optimization experiments is in the metric definition. Success metrics in these experiments may be the same for all users, or they may differ from user to user based on personalized characteristics. For example, what is the best way to optimize a conversion flow? The answer could be very different according to the makeup of your user base.

Up to this point, we’ve viewed experiments as binary. A or B. New or old. The A/B/n experiments we discussed earlier work like this and have tremendous value. But these experiments work best when n represents a small number of options—say, two or three variations. With too many options, these types of tests can become unwieldy and produce data that is difficult to interpret.

When the number of variations grows beyond the point that a standard A/B/n test can handle, one solution is to leverage parameters within optimization experiments. Sometimes these experiments can be run with machine learning solutions to find the optimal outcome from a wide variety of options. Machine learning algorithms such as a contextual multiarmed bandit, which allows multiple variations to be tested and sends more traffic to the successful variation as time goes on, will try different combinations of the provider variable set to learn which combinations work best in different circumstances.

Setting broader parameters allows machines or humans (developers or others) to change the way an experiment is being run without altering code. Product managers, marketers, operations teams, and others can test small changes in the variations that affect the users without needing developers to push new code configurations. This frees up developers to focus on their next sprint and lets the people who are closest to the users and know them best make the decisions.

Experimentation Examples

We’ve talked a lot about experimentation in the abstract sense, but what does this actually look like in the real world? Let’s walk through a handful of examples that illustrate the power of adopting the experimentation framework that we laid out in this chapter.

Measuring a Sign-Up Flow

For our first example, let’s look at a typical sign-up flow. A sign-up flow has two goals: to have the highest possible number of users complete the flow, and to extract the most information from the users. Unfortunately, these two goals can sometimes conflict with each other. Having fewer questions decreases the friction in signing up and results in more sign-ups, but adding questions to the sign-up flow means gaining a greater understanding of each user. Both are important goals, but they can negatively impact each other and require a certain balance.

A lot of variables can impact these goals: the number of questions, the placement of the sign-up box, the color of the sign-up box, the color of the sign-up font, the inclusion of an exclamation point (Sign up!), the number of questions asked of the user, the order of the questions, which questions are mandatory, and so forth.

With this many variables, and with so many unknowns about who visits the sign-up page, how can anyone reliably know which questions to include and which ones to make mandatory? Will the answers change based on the user’s geography, referral source, or history on the site? With many variables come many possible answers—too many for A/B/n tests to measure.

Enter optimization experiments. Creating an optimization experiment that leverages a machine learning solution, open source tools like Ax, and hosted SaaS services as examples lets organizations utilize machine models, which will become increasingly robust as the technology develops. Organizations can set parameters that let machine learning models try out all the options to see what works best. In this example, you can have parameters for all these options and let the machine learning model try them in all the different combinations. The model will discard combinations that don’t perform well and try ones that do perform well against one another. This process will be repeated at scale until at the end you have an optimized set of results.

The parameters need to be set up by engineers who understand how to manipulate the experimentation platform. Once the setup is complete, the reins can be handed over to others to run. In the sign-up flow example, product managers or marketers might be best suited to run the optimization experiment.

Fifty Shades of Blue

Another relevant example of experimentation from Google is its “50 shades of blue” experiment. When Google launched Gmail, it had blue ads on the right sidebar. But there are many shades of blue, something you’ve likely encountered if you’ve ever gone to the paint section of a home improvement store. Which blue would entice more users to click? Anyone could have guessed, but no one knew.

So Google tested 40+ shades of blue, trying out each variation with 1% of its users. The data showed that a shade of blue with a slightly purple tint outperformed the other options. The implementation of this specific blue into the ads resulted in $200 million in additional ad revenue.

Not every experiment will lead to $200 million in measurable impact, since not every product operates at the scale of Gmail. Still, every team has goals to hit and metrics to surpass, and optimization experiments can surface better options than would opinions, hunches, or guesswork. Improving the sign-up form conversion rate by 2% has a measurable impact as well. Continually stacking 2% improvements on top of one another adds up.

Machine learning was not used widely outside of academia when Gmail launched, so the world of testable combinations has greatly expanded since Google ran its experiment. If the experiment were to take place today, the color of the ads could be combined with the font, size, and character length to further optimize for the highest click-through rate.

Summary

Let’s recap what we’ve covered in this chapter. Good measurement practices are the first step toward understanding and the foundation for experimentation. Creating a culture of experimentation for product validation, risk mitigation, and optimization allows organizations to make decisions based on data rather than on opinions. Last, modern experimentation platforms allow cohesive, heterogeneous teams to test hypotheses and make better decisions across the entire operational process.

We’ve illustrated where measuring and monitoring start to diverge in purpose. Whereas teams used to rely on SREs to monitor systems in case something bad happened, now developers and product teams can run experiments that inform their next steps before something goes wrong. While experiments should happen concurrently at all stages of the SDLC, monitoring was confined to the final stage in the ops process. The measure and experiment stage is a critical step for modern software delivery pipelines.

Measurement and experimentation enable teams across an organization to take proactive steps to test new capabilities and allow them to create a framework for how new features and products can simultaneously be rolled out while also mitigating risk. The takeaway here should be this: create a strong system for measurement to track and collect data, use those measurements to determine success metrics, create a hypothesis based on those success metrics, and run experiments to prove or disprove the hypothesis.

Get Operating Continuously now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.