Chapter 1. Resilience in Software and Systems

The world stands on absurdities, and without them perhaps nothing at all would happen.

Fyodor Dostoevsky, The Brothers Karamazov

In our present reality, cybersecurity is more of an arcane art than science—an inscrutable, often excruciating, sequence of performative rituals to check boxes1 that affirm you’ve met the appropriate (and often arbitrary) regulatory or standards requirement. In terms of systems security, the ideal state is one of resilience—ensuring that systems can operate successfully now and in the future despite the dangers lurking in our digital world. To sustain resilience, we must understand how all the system’s machines and humans interact in pursuit of a common goal and how they respond to disruption.2 Knowing the intricacies of cybersecurity isn’t enough. We must understand the system’s resilience (or lack thereof) if we hope to protect it, which involves understanding the system’s dynamics, as we’ll explore in this chapter. This is why, throughout this book, we treat security as a subset of resilience.

This book is a practical guide for how to design, build, and operate systems that are more resilient to attack. We will prioritize progress over perfection. We will draw on lessons from other complex systems domains, like healthcare, aerospace, disaster recovery, ecology, urban infrastructure, and psychology; indeed, the rich discourse on resilience in these other domains is another reason why resilience is the foundation of Security Chaos Engineering (SCE). “Security” is an abstract, squishy concept that is largely self-contained within the cybersecurity (or “infosec”) industry, with occasional appropriations of concepts from physical safety, law enforcement, and, rather infamously, warfare. With resilience, however, there’s much we can learn from other disciplines to help us in our quest to operate our systems safely, reducing the amount of work and “thinky thinky” required for us to succeed.

As we shift our focus from security to resilience, we gain a superpower: we invest our time, energy, and other resources on outcome-driven activities rather than wasting those resources on performative work that may feel productive, but does not, in reality, protect our systems. Resilience matters in any complex system and is especially illuminating in complex systems involving humans, inspiring us to change what we can to prepare for success in an uncertain future. This is our vision for security programs, which may become resilience programs going forward. If you join the SCE movement, you can protect your organization’s ability to thrive now and in the future. All that’s required for you to join is an openness to new perspectives and new ways of achieving security outcomes, which is the focus of this book.

SCE seeks to uncover, through experimental evidence, whether our systems are resilient to certain conditions so we can learn how to make them even more resilient. Failure is a normal condition of system operation. SCE offers organizations a pragmatic set of principles and practices for proactively uncovering unknown failure within their systems before it manifests into customer-facing and business-impacting problems. Those practicing SCE eschew waiting to passively witness how things break in production, morphing from a reactive stance to a proactive one.

These are just some of the transformative outcomes possible by adopting a resilience-based approach to systems security through SCE. Before we embark on our quest, we must—like any noble scientist—absorb the foundational concepts that will pave our journey. What is a complex system? What is failure? What is resilience? And how does SCE fit in? This chapter answers all of those questions. A lot of traditional infosec folk wisdom will be challenged and discarded in favor of hard-earned, empirical lessons from resilience across domains. We invite you, like Neo in The Matrix, to free your mind and take a leap with us into the real world where resilience is not just how we survive, but how we thrive.

What Is a Complex System?

A complex system is one in which a bunch of components interact with each other and generate nonlinear behavior as a result. Humans deal with complex systems constantly, from global supply chains to national economies to cities to our own bodies and our own brains. But before we explore complex systems in more depth, we should probably understand what complexity is.

Complexity is formally defined as a summary term for the properties that emerge from interrelations between variables in a system.3 As system capabilities increase, interdependencies between these variables extend, deepen, and become more obscure. These interdependencies can lead to a domino effect where disturbances diffuse from one variable to others connected to it. This is referred to as a “contagion effect” and is seen across domains like financial markets,4 psychology,5 and, of course, biological viruses.6 In distributed systems, you’re likely to hear this same phenomenon referred to as “cascading failure” (which we will explore in more depth in Chapter 3).

What makes a system complex in practice? Well, let’s think about simple systems first. Linear systems are easy. If something bad happens, you just make sure the system is robust enough to bounce back to its status quo. Consider Mr. Potato Head; there is a clear, direct cause and effect between you plugging in his eyeballs and the eyeballs residing on his face. If you plug his ear into his arm socket instead, it is easy to revert to the intended layout—and this mistake does not cause problems with his eyes, feet, hat, mouth, or iconic mustache. The potato “head” base interacts with the appendage components in an extremely predictable, repeatable fashion. But linear systems like Mr. Potato Head7 are more likely found in textbooks and contrived hypotheticals rather than the real world, which is messy. To adapt the classic wisdom for computer systems, “No ‘perfect’ software service survives first contact with production traffic.”

Where do complex systems and linear systems most diverge? To answer that question, let’s explore the nature of complex systems, including variety, adaptability, and holisticness.

Variety Defines Complex Systems

Variety is perhaps the defining element of complexity; what we tend to describe as “complex” systems are those with a great degree of variety. Systems are replete with all kinds of variety: the variety of components, the variety of interactions between those components, and the variety of potential outcomes in the system. Our computer systems involve a lot of variables and components—wetware (our brains), software, hardware—and thus offer a large variety of potential future states too.

Because complex systems involve a veritable festival of variables cavorting and gallivanting together in often unconventional ways, they can present a large variety of possible states.8 Safety research organization Jepsen notes that distributed systems “have a logical state which changes over time” and all types of complex computer systems are no different. This is why prediction is perceived as an extravagant distraction in a resilience paradigm. With so many possible future states within the system, there is never just one path that will lead to a particular outcome.9 Even trickier, performing the same sequence of actions in the system may result in different outcomes. Getting from point A to point B in a complex system is less “as the crow flies” and more “as the cat wanders (or zoomies).”

Complex Systems Are Adaptive

Importantly, complex systems are adaptive; they change and evolve over time and space, especially in response to changes in their external environment. Our software systems are complex adaptive systems and, as some computer people sometimes forget, they are sociotechnological in nature. Both machines and humans qualify as components of our tech systems, whether production systems or corporate IT systems. The cybersecurity ecosystem is a complex adaptive system in itself, consisting of not only a huge variety of machines, but also developers, defenders, end users, government regulators, government-sponsored attackers, criminal organizations, auditors, and countless other human stakeholders.

Complex systems are adaptive because change cannot be stopped—and therefore failure cannot be stopped from ever occurring. A resilient complex system is one that can handle this failure gracefully, adapting its behavior to preserve its critical functions.10 Understanding the transitions between the variety of potential future states—how a system adapts to ongoing changes—is key to understanding your system’s resilience and security over time.

We also must appreciate how humans are a source of strength when a system nears the boundary between success and failure; humans are, in many cases, our mechanism for adaptation in the face of adverse and evolving conditions due to their natural tendency to be curious about the problems they face.11 In healthcare, for example, emergency departments are often brittle due to management-driven changes imposed by financial and operational pressures, making them less resilient in the face of “accumulating or cascading demands” that push them beyond their known-safe capacity.12 The emergency department system therefore must “stretch” in the face of greater demands on its operation so individual failures—like necessary activities falling through the cracks—do not accumulate and tip the overall system into failure. How does this stretching happen? The humans in the system are the ones who enable this stretching. Humans can work harder, adjust their strategies, and scour for extra resources, like asking other humans to come help out to provide additional adaptive capacity.13

The Holistic Nature of Complex Systems

The complexity of a system—and its behavior—is defined holistically; little can be gleaned from the behavior of individual constituent components.14 Systems thinking is not natural to most humans and security professionals aren’t exempt. The cybersecurity industry has trended toward componentization, even down to specific tooling and environments. Unfortunately, this component-based focus restricts your ability to understand how your slice of security relates to all the other components, which together shape how the system operates. Only looking at one component is like a narrow viewing frustum—or, if you prefer, like a cone collar that makes it difficult to navigate nimbly about the world. Nancy G. Leveson, professor of aeronautics and astronautics at MIT, cautions us that “a narrow focus on operator actions, physical component failures, and technology may lead to ignoring some of the most important factors in terms of preventing future accidents.”15

Attackers perceive the holistic nature of systems, routinely taking advantage of interrelations between components. They look for how interactions within a system can give them an advantage. All the attack phase “lateral movement” means is leveraging the connection between one component in a system to others within the system. Defenders, however, traditionally conceive security in terms of whether individual components are secure or whether the connections between them are secure. Status quo security thinks in terms of lists rather than graphs, whereas attackers think in graphs rather than lists. There is a dearth of systems knowledge in traditional cybersecurity defense, which attackers are all too happy to exploit.

The high-profile SolarWinds compromise highlights this dynamic well. The Russian Foreign Intelligence Service (SVR) had a particular objective in mind: gaining access to federal agencies and Fortune 500 companies alike for (presumably) espionage purposes. One way they could achieve this outcome was by looking for any components and subsystems these target systems had in common. SolarWinds’ Orion platform, offering infrastructure monitoring and management, was a tool that was not only used by many federal agencies and large enterprises, but also possessed functionality that granted it access to customers’ internal infrastructure. Through this lens, attackers exploited interrelations and interactions at multiple levels of abstraction.

A holistic systems perspective must not be limited to specific technology, however. The way in which organizations use technology also involves economic and social factors, which are all too frequently ignored or overlooked by traditional enterprise cybersecurity programs. Economic factors in an organization include revenue and profit goals, how compensation schemes are structured, or other budgetary decisions. Social factors in an organization include its key performance indicators, the performance expectations of employees, what sort of behavior is rewarded or reprimanded, or other cultural facets.

As an industry, we tend to think of vulnerabilities as things borne by flaws in software, but vulnerabilities are borne by incentive paradigms too. We overlook the vulnerability in incentivizing employees to do more work, but faster. We overlook the vulnerability in giving bonuses to “yes” people and those who master political games. These vulnerabilities reduce the organization’s resilience to failure just the same as software flaws, but they rarely appear in our threat models. Occasionally, we will identify “disgruntled employees” as a potential insider threat, without exploring the factors that lead them to be disgruntled.

These “layer 8” (human) factors may be difficult to distinguish, let alone influence, in an organization. But when we consider how failure manifests, we must nevertheless take them into account.

What Is Failure?

Failure refers to when systems—including any people and business processes involved—do not operate as intended.16 A service that does not complete communication with another service on which it depends would count as a failure. Similarly, we can consider it a failure when security programs do not achieve security objectives. There are more possible failures in software than we could hope to enumerate: abuse of an API, which results in a leak of user data that provokes anxiety and requires users to purchase credit monitoring services; a denial-of-service (DoS) attack, which lasts long enough to violate service-level agreements (SLAs) and trigger revenue refunds; repetitive, tedious, and manual tasks with unclear outcomes, which result in burned out, resentful human operators. And so on, into infinity.

Failure is inevitable and happening all the time. It is a normal part of living and operating in a complex system, and our decisions—successful or not—influence the outcomes. Regardless of the domain of human activity, avoiding failure entirely is possible only by avoiding the activity entirely: we could avoid plane crashes by never flying planes; avoid deaths during surgery by never undergoing surgery; avoid financial crashes by never creating a financial system; or, in the realm of cybersecurity, avoid software failure by never deploying software. This sounds silly, and it is, but when we aim for “perfect security,” or when the board of directors demands that we never experience a security incident, we are setting ourselves up (ironically) for failure. Since the status quo goal of security programs is to “prevent incidents,” it’s no wonder practitioners feel engaged in a battle they’re constantly losing.

Despite its common characterization in cybersecurity, security failure is never the result of one factor. A failure is never solely because of one vulnerability’s existence or the dismissal of a single alert. Failure works like a symphony, with multiple factors interacting together in changing harmonies and discords. As such, we must adopt a systems perspective when seeking to understand security failure, expanding our focus to look at relationships between components rather than pinning the blame to a singular cause; this systems perspective will be the focus of Chapter 2.

When we think about security failure, we also tend to think about situations that occur once systems are deployed and running in production—like data breaches. But the conditions of security failure are sowed much earlier. Failure is a result of interrelated components behaving in unexpected ways, which can—and almost always do—start much further back, in how systems are designed and developed and in other activities that inform how our systems ultimately look and operate.

Failure is a learning opportunity. It is a chance to untangle all the factors that led to an unwanted event to understand how their interaction fomented failure conditions. If you do not understand the goals, constraints, and incentives influencing a system, you will struggle to progress in making the system more resilient to attack.

Acute and Chronic Stressors in Complex Systems

When we think about security incidents, we may tend to blame the incident on an acute event—like a user double-clicking on a malicious executable or an attacker executing a crypto miner payload. In resilience lingo, these acute events are known as pulse-type stressors. Pulse-type stressors are negative inputs to the system that occur over a short duration, like hurricanes in the context of ecological systems. Press-type stressors, in contrast, are negative inputs that occur over longer periods of time; in ecological systems, this can include pollution, overfishing, or ocean warming. For clarity, we’ll call pulse-type stressors “acute stressors” and press-type stressors “chronic stressors” throughout the book.

The problem with a myopic focus on acute stressors in any complex system is that those events will not tip the system over into failure modes on their own. The background noise of chronic stressors wears down the resilience of the system over a longer period of time, whether months or years, so when some sort of acute event does occur, the system is no longer able to absorb it or recover from it.

What do chronic stressors look like in cybersecurity? They can include:

  • Regular employee turnover

  • Tool sprawl and shelfware

  • Inability to update systems/software

  • Inflexible procedures

  • Upgrade-and-patch treadmill

  • Technology debt

  • Status quo bias

  • Employee burnout

  • Documentation gaps

  • Low-quality alert signals

  • Continuous tool maintenance

  • Strained or soured relationships across the organization

  • Automation backlog or manual procedure paradigm

  • Human-error focus and blame game

  • Prevention mindset

And acute stressors in cybersecurity can include:

  • Ransomware operation

  • Human mistake

  • Kernel exploit

  • New vulnerability

  • Mergers, acquisitions, or an IPO event

  • Annual audit

  • End-of-life of critical tool

  • Outage due to security tool

  • Log or monitoring outage

  • Change in compliance standard

  • New product launch

  • Stolen cloud admin credentials

  • Reorg or personnel issues

  • Contractual changes (like SLAs)

While recovery from acute stressors is important, understanding and handling chronic stressors in your systems will ensure that recovery isn’t constrained.

Surprises in Complex Systems

Complex systems also come with the “gift” of sudden and unforeseen events, referred to as “surprises” in some domains. Evolutionary biologist Lynn Margulis eloquently described the element of surprise as “the revelation that a given phenomenon of the environment was, until this moment, misinterpreted.”17 In the software domain, the surprises we encounter come from computers and humans. Both can surprise us in different ways, but we tend to be less forgiving when it’s humans who surprise us. It’s worth exploring both types of surprises because accepting them is key to maintaining resilience in our complex software systems.

Computer surprises

Computers surprise us in a vexing variety of ways. Kernels panic, hard disks crash, concurrency distorts into deadlocks, memory glitches, and networks disconnect (or worse, network links flap!). Humanity is constantly inundated with computer surprises.18 An eternal challenge for programming is how to ensure that a program behaves exactly as its designer intended.

When we encounter computer surprises, our instinct (other than cursing the computer) is to eliminate the potential for the same surprise in the future. For hardware, we might enable Trusted Platform Modules (TPMs) in an attempt to better secure cryptographic keys, or we might add redundancy by deploying additional physical replicas of some components. For software, we might add a vulnerability scanner into our development pipeline or institute a requirement for manual security review before a push to production. If you can remove all bugs from software before it reaches production environments, then you can minimize computer surprises, right?

Of course, none of these responses—nor any response—will eradicate the phenomenon of computer surprises. Surprises, like failures, are inevitable. They’re an emergent feature of complex systems.

Yet some computer surprises are rather dubious in the context of security. Attackers using software like NetWire, a public remote administration tool first seen in 2012 and still seen as of the end of 2022, should not be surprising. The reality is that the “fast and ever-evolving attacker” narrative is more mythology than reality (a particularly convenient myth for security vendors and defenders wanting to avoid accountability). Attackers evolve only when they must because their operational strategy, tooling, or infrastructure no longer provides the desired outcomes; we’ll discuss their calculus in Chapter 2.

Tip

Complexity adds vivacity, novelty, and significance to our lives. Without complexity, we wouldn’t have communities, economies, or much progress at all. If we attempt to expunge all complexity from our sociotechnical systems, we may banish adverse surprises at the expense of ostracizing pleasant surprises. The innovation and creativity we cherish is often stifled by “streamlining.”

How can we encourage opportunities to flourish instead—to enliven rather than deaden our systems? We must enhance, rather than impede, complexity19 by preserving possibilities and minimizing the disruption begotten by complexity. Meanwhile, we can find the right opportunities to make parts of our systems more linear—more independent components with more predictable, causal influence on outcomes—so that we can expend more effort nourishing plasticity. We’ll navigate how to pursue these opportunities across the software delivery lifecycle in Chapters 3 through 7.

Human surprises

Even though most kindergarteners understand that making mistakes is an inevitable part of life, the traditional cybersecurity industry seems to believe that human mistakes can be eliminated from existence. This magical thinking manifests in status quo security strategies that are extremely brittle to human surprises (which is often correlated with brittleness to computer surprises too). A human clicking on a link in an email, an activity they perform thousands of times in a year without incident, should perhaps be the least surprising event of all. A human procrastinating on patching a server because there are a dozen change approval steps required first is also not surprising. And yet both events are still frequently blamed as the fundamental cause of security incidents, as if the failure is in the human exhibiting very normal human behavior rather than a security strategy that explicitly chooses to be ignorant about normal human behavior.

Systems that maintain the greatest resilience to human surprises combine approaches, recognizing that there is no “one size fits all.” One element is designing (and continuously refining) systems that fit the human, rather than forcing the human to fit the system through stringent policies. In this mindset, user experience is not a nice-to-have, but a fundamental trait of resilient design (which we’ll explore in Chapter 7). Another element is pursuing a learning-led strategy, where mistakes are opportunities to gain new knowledge and inform improvements rather than opportunities to blame, scapegoat, or punish.

We cannot remove the human from our systems anytime soon, and therefore the human element is relevant to all parts of SCE—and to how we nourish resilience in our systems.

What Is Resilience?

Resilience is the ability for a system to adapt its functioning in response to changing conditions so it can continue operating successfully. Instead of attempting to prevent failure, resilience encourages us to handle failure with grace. More formally, the National Academy of Sciences defines resilience as “the ability to prepare and plan for, absorb, recover from, and more successfully adapt to adverse events.”20 In their journal article “Features of Resilience,” the authors outline the five common features that define resilience:21

  1. Critical functionality

  2. Safety boundaries (thresholds)

  3. Interactions across space-time

  4. Feedback loops and learning culture

  5. Flexibility and openness to change

All of these features—which we’ve correlated to the ingredients of our resilience potion below—are relevant to SCE and will reprise as a reinforcing motif throughout the book. Let’s discuss each of them in more detail.

Critical Functionality

If you want to understand a system’s resilience, you need to understand its critical functionality—especially how it performs these core operations while under stress from deleterious events. Simply put, you can’t protect a system if you don’t understand its defining purpose.

Defining the system’s raison d'être is an essential requirement to articulate a system’s resilience. The goal is to identify the resilience of what, to what, and for whom. Without this framing, our notion of the system’s resilience will be abstract and our strategy aimless, making it difficult to prioritize actions that can help us better sustain resilience.

This foundational statement can be phrased as “the resilience of <critical functionality> against <adverse scenario> so that <positive customer (or organization) outcome>.”

For a stock market API, the statement might read: “The resilience of providing performant stock quotes against DoS attacks for financial services customers who require real-time results.” For a hospital, the statement might read: “The resilience of emergency room workstations against ransomware so healthcare workers can provide adequate care to patients.” Defining what is critical functionality, and what is not, empowers you during a crisis, giving you the option to temporarily sacrifice noncritical functionality to keep critical functionality operational.

This illustrates why the status quo of security teams sitting in an ivory tower silo won’t cut it for the resilience reality. To design, build, and operate systems that are more resilient to attack, you need people who understand the system’s purposes and how it works. Any one person’s definition of a system’s critical functions will be different from another person’s. Including stakeholders with subjective, but experience-informed, perceptions of the system becomes necessary. That is, “defenders” must include a blend of architects, builders, and operators to succeed in identifying a system’s critical functions and understanding the system’s resilience. The team that manages security may even look like a platform engineering team—made up of software engineers—as we’ll explore in depth in Chapter 7.

Safety Boundaries (Thresholds)

Our second resilience potion ingredient is safety boundaries, the thresholds beyond which the system is no longer resilient to stress. We’ll avoid diving too deep into the extensive literature22 on thresholds23 in the context of system resilience.24 For SCE purposes, the key thing to remember about safety boundaries is that any system can absorb changes in conditions only up to a certain point and stay within its current, healthy state of existence (the one that underlies our assumptions of the system’s behavior, what we call “mental models”). Beyond that point, the system moves into a different state of existence in which its structure and behavior diverge from our expectations and intentions.25

A classic example of safety boundaries is found in the book Jurassic Park.26 When the protagonists start their tour of the park, the system is in a stable state. The dinosaurs are within their designated territories and the first on-rail experience successfully returns the guests to the lodge. But changes in conditions are accumulating: the lead geneticist fills in gaps in velociraptors’ DNA sequences with frog DNA, allowing them to change sex and reproduce; landscapers plant West Indian lilacs in the park, whose berries stegosauruses confuse for gizzard stones, poisoning them; a computer program for estimating the dinosaur population is designed to search for the expected count (238) rather than the real count (244) to make it operate faster; a disgruntled employee disables the phone lines and the park’s security infrastructure to steal embryos, causing a power blackout that disables the electrical fences and tour vehicles; the disgruntled employee steals the emergency Jeep (with a rocket launcher in the back), making it infeasible for the game warden to rescue the protagonists now stranded in the park. These accumulated changes push the park-as-system past its threshold, moving it into a new state of existence that can be summarized as chaos of the lethal kind rather than the constructive kind that SCE embraces. Crucially, once that threshold is crossed, it’s nearly impossible for the park to return to its prior state.

Luckily with computer systems, it’s a bit easier to revert to expected behavior than it is to wrangle dinosaurs back into their designated pens. But only a bit. Hysteresis, the dependence of a system’s state on its history,27 is a common feature of complex systems and means there’s rarely the ability to fully return to the prior state. From a resilience perspective, it’s better to avoid passing these thresholds in the first place if we can. Doing so includes continual evolution of the system itself so its safety boundaries can be extended to absorb evolving conditions, which we’ll explore later in our discussion of flexibility and openness to change.

If we want our systems to be resilient to attack, then we need to identify our system’s safety boundaries before they’re exceeded—a continuous process, as safety boundaries change as the system itself changes. Security chaos experiments can help us ascertain our system’s sensitivity to certain conditions and thereby excavate its thresholds, both now and as they may evolve over time. With an understanding of those safety boundaries, we have a better chance of protecting the system from crossing over those thresholds and tipping into failure. As a system moves toward its limits of safe operation, recovery is still possible, but will be slower. Understanding thresholds can help us navigate the tricky goal of optimizing system performance (gotta go fast!) while preserving the ability to recover quickly.

Tip

By conducting chaos experiments regularly—or, even better, continuously—experimental outcomes should reveal whether your systems (from a sociotechnical perspective) seem to be drifting toward thresholds that might make them less resilient in the face of a sudden impact. This drift could indicate the presence of chronic stressors, which are worth taking the time to dig into and uncover so you can nurse the systems (and the team) back to a healthier operational baseline. (We’ll discuss chaos experimentation for assessing resilience in Chapter 2.)

Finally, remember that complex systems are nonlinear. What may seem like a minor change can be the proverbial final straw that pushes the system past its resilience threshold. As eloquently stated by ecological resilience scholars, “relatively small linear changes in stressors can cause relatively abrupt and nonlinear changes in ecosystems.”28 It is never just one factor that causes failure, but an accumulation that breaches the boundaries of the system’s safe operation.

Interactions Across Space-Time

Because complex systems involve many components interacting with each other, their resilience can only be understood through system dynamics across space and time. As variables (not the programming kind) within your system interact, different behaviors will unfold over time and across the topology of the system. The temporal facets of resilience include the timing of an incident as well as the duration between the incident occurring and recovery of system functionality. The spatial facet is the extent of the incident’s impact—the resulting state of the system across its components, functionality, and capability.

For instance, when considering the resilience of a consumer-facing application against a distributed denial-of-service (DDoS) attack, one or some or all services might be affected. The attack can happen during peak traffic hours, when your servers are already overwhelmed, or during sleep time for your target customers, when your servers are yawning with spare capacity. Recovery to acceptable performance standards might be milliseconds or hours. A long outage in one service can degrade performance in other services to which it’s connected. And an outage across all services can lead to a longer recovery time for the system as a whole. As this example shows, time and space are inherently entwined, and you must consider both when assessing resilience.

There’s one caveat to the notion of “time” here (other than the fact that humanity still doesn’t understand what it really is).29 Time-to-recovery is an important metric (which we’ll discuss in Chapter 5), but as we learned in our discussion of safety boundaries, it’s equally important to consider that there might be preferable alternative states of operation. That is, continually recovering to the current equilibrium is not always desirable if that equilibrium depends on operating conditions that don’t match reality. As in our example of dinosaurs gone wild in Jurassic Park, sometimes you simply can’t return to the original equilibrium because reality has changed too much.

Feedback Loops and Learning Culture

Resilience depends on remembering failure and learning from it. We want to handle failure with ease and dignity rather than just prevent it from occurring; to do so, we must embrace failure as a teacher. Feedback loops, in which outputs from the system are used as inputs in future operations, are therefore essential for system resilience. When we observe an incident and remember the system’s response to it, we can use it to inform changes that will make the system more resilient to those incidents in the future. This process of learning and changing in response to events is known as adaptation, which we introduced earlier in the chapter and will cover in more detail shortly. Unfortunately, we often hear the infosec folk wisdom of how quickly attackers adapt to defenses, but there isn’t as much discussion about how to support more adaptation in our defenses (outside of hand-wavy ballyhoo about AI). SCE aims to change that.

The importance of operational memory for resilience is seen across other domains too. For instance, ecological memory, the ability of the past to influence an ecosystem’s present or future responses,30 involves both informational and material memory. Informational memory includes adaptive responses while material memory includes new structures, like seeds or nutrients, that result from a disturbance.31

Maintaining this memory is nontrivial, but diversity within a system helps. Socioecological systems that adapt through modular experimentation can more effectively reorganize after a disturbance.32 When a system’s network of variables is more diverse and the connections between them more modular, failure is less likely to reach all functions within a system. This ability to reorient in the face of failure means that fewer variables and connections between them are lost during a disturbance—preserving more memories of the event that can be used as inputs into feedback loops.

As an example, the urbanization of Constantinople was successful in part because the community learned from repeated sieges upon the city (on average every half-century).33 These repeated incidents, write archaeology and urban sustainability scholars, “generated a diversity of socioecological memories—the means by which the knowledge, experience, and practice of how to manage a local ecosystem were stored and transmitted in a community.” It wasn’t just historians preserving these memories. Multiple societal groups maintained memories of the siege, leading to adaptations like decentralization of food production and transportation into smaller communities that were less likely to be disrupted by a siege—conceptually akin to a modular, service-oriented architecture that isolates disruption due to a resource spike in one component. Defensive walls were actually moved to make space for this new agricultural use as well as for gardens. These walls served as visual reminders of the lessons learned from the siege and protected those memories from dissolving over time.

As we will continue to stress throughout the book, our computer systems are sociotechnical in nature and therefore memory maintenance by communities is just as essential for systems resilience in our world as it was for Constantinople. No matter who is implementing the security program in your organization, they must ensure that insights learned from incidents are not only stored (and in a way that is accessible to the community), but also leveraged in new ways as conditions change over time. We’re all too familiar with the phenomenon of a company being breached by attackers and suddenly investing a mountain of money and a flurry of energy into security. But, as memory of the incident fades, this enhanced security investment fades too. Reality doesn’t take a break from changing, but not experiencing a shocking event for a while can reduce the need to prepare for them in the future.

A learning culture doesn’t just happen after an incident. Feedback loops don’t happen once, either. Monitoring for changes in critical functions and system conditions is complementary to preserving incident memories. Incidents—whether caused by attackers or security chaos experiments—can provide a form of sensory input that helps us understand causal relationships within our systems (similar to touching a hot stove and remembering that it leads to “ouch”). Monitoring, logging, and other forms of sensing help us understand, based on those causal relationships, whether our systems are nearing the boundaries of safe operation. While past behavior isn’t an indicator of future behavior, the combination of learning from memories, collecting data on system behavior, and conducting experiments that simulate failure scenarios can give us far more confidence that when the inevitable happens, we’ll be prepared for it.

Flexibility and Openness to Change

Finally, maintaining flexibility across space-time is an essential part of resilience. This is often referred to as “adaptive capacity” in resilience engineering.34 Adaptive capacity reflects how ready or poised a system is for change—its behaviors, models, plans, procedures, processes, practices—so it can continue to operate in a changing world featuring stressors, surprises, and other vagaries.35 We must sustain this flexibility over time too, which can get tricky in the face of organizational social dynamics and trade-offs.36

As we mentioned with learning cultures and feedback loops (and will discuss more in Chapter 4), modularity can support resilience. Keeping the system flexible enough to adapt in response to changes in operating conditions is one element of this. The system must be able to stretch or extend beyond its safety boundaries over space and time. (Hopefully you’re starting to see how all the ingredients in our resilience potion complement each other!) Another way to think about this, as David D. Woods pithily puts it, we must “be prepared to be surprised.”

As always, the human element is poignant here too. We, as stakeholders who wish to keep our systems safe, also need to be open to change within ourselves (not just in our machines). We might be wedded to the status quo or we may not want to change course because we’ve already invested so much time and money. Maybe something was our special idea that got us a promotion. Or maybe we’re worried change might be hard. But this cognitive resistance will erode your system’s ability to respond and adapt to incidents. A good decision a year ago might not be a good decision today; we need to be vigilant for when our assumptions no longer ring true based on how the world around us has changed.

There’s never just one thing that will affect your systems; many things will continue to surprise or stress the system, constantly shifting its safety boundaries. Flexibility is the essential property that allows a system to absorb and adapt to those events while still maintaining its core purpose. We’ll never cultivate complete knowledge about a system, no matter how much data we collect. Even if we could, reducing uncertainty to zero may help us understand the system’s state right now, but there would still be plenty of ambiguity in how a particular change will impact the system or which type of policy or procedure would enhance the system’s resilience to a particular type of attack. There are simply too many factors and interconnections at play.

Because we live in this indeterministic reality, we must preserve an openness to evolution and discard the rigidness that status quo security approaches recommend. Building upon the feedback loops and learning culture we discussed, we must continue learning about our systems and tracking results of our experiments or outcomes of incidents to refine our understanding of what a truly resilient security program looks like for our systems. This continual adaptation helps us better prepare for future disruptions and gives us the confidence that no matter what lies ahead, it is within our power to adjust our response strategies and ensure the system continues to successfully fulfill its critical function.

Resilience Is a Verb

“Resilient” or “safe” or “secure” is an emergent property of systems, not a static one.39 As a subset of resilience, security is something a system does, not something a system has. As such, security can only be revealed in the event stream of reality. Resilience not only represents the ability to recover from threats and stressors, but also the ability to perform as needed under a variety of conditions and respond appropriately to both disturbances as well as opportunities. Resilience is not solely about weathering tempests. It’s also about innovation—spotting opportunities to evolve your practices to be even better prepared for the next storm. Resilience should be thought of as a proactive and perpetual cycle of system-wide monitoring, anticipating disruptions, learning from success and failure, and adapting the system over time.40 While “resiliencing” would be the most appropriate term to use to capture the action-verb nature of resilience, we will use turns of phrase like “sustain resilience” or “maintain resilience” throughout the book for clarity.

Similarly, security is a value we should continually strive to uphold in our organizations rather than treat as a commodity, destination, or expected end state. We must think in terms of helping our systems exist resiliently and securely in the capricious wilds of production, rather than “adding security” to systems. This perspective helps us understand how failure unfolds across a system, which allows us to identify the points at which this failure might have been stopped or diverted. This ultimately helps inform which signals can help us identify failure earlier, continuing the cycle.

Viewing security as something a system does rather than has also positions you to anticipate conditions for failure that might emerge in the future. Human factors and systems safety researchers Richard Cook and Jens Rasmussen note in their journal article “Going Solid” that as a system continues to operate over time, it has “a tendency to move incrementally towards the boundaries of safe operations”41—those thresholds we discussed earlier. Productivity boosts due to technology rarely manifest in shorter work hours or a newfound capacity to improve security. Organizations will always want to perform their activities with less expense and greater speed, and this desire will manifest in its systems.

Continually evolving and adapting systems provides emergent job security for those responsible for resilience and security. If the security program can help the organization anticipate new types of hazards and opportunities for failure as the organization evolves its systems, then security becomes invaluable. This is often what cybersecurity writ large thinks it does, but in reality it tends to anchor the organization to the past rather than look ahead and carve a safe path forward. What other misperceptions slink within the common cybersecurity conversation? Next, we’ll explore other myths related to resilience.

Resilience: Myth Versus Reality

Aristotle argued that to understand anything, we must understand what it is not.42 This section will cover the myths and realities of resilience with the goal of helping you be more resilient to snake oil around the term.

Myth: Robustness = Resilience

A prevalent myth is that resilience is the ability of a system to withstand a shock, like an attack, and revert back to “normal.” This ability is specifically known as robustness in resilience literature. We commonly see resilience reduced to robustness in the cybersecurity dialogue, especially in overzealous marketing (though cybersecurity is not the only domain felled by this mistake). The reality is that a resilient system isn’t one that is robust; it’s a system that can anticipate potential situations, observe ongoing situations, respond when a disturbance transpires, and learn from past experiences.43

A focus on robustness leads to a “defensive” posture. Like the giant oak of Aesop’s fable, robustness tries to fight against the storm while the reeds—much like adaptive systems—humbly bend in the wind, designed with the assumption that adverse conditions will impact them. As a result, the status quo in cybersecurity aims for perfect prevention, defying reality by attempting to keep incidents from happening in the first place. This focus on preventing the inevitable distracts us from preparing for it.

Robustness also leads us to prioritize restoring a compromised system back to its prior version, despite it being vulnerable to the conditions that fostered compromise.44 This delusion drives us toward technical or punitive controls rather than systemic or design mitigations, which in turn creates a false sense of security in a system that is still inherently vulnerable.45 Extracting a lesson from the frenetic frontier of natural disasters, if a physical barrier to flooding is added to a residential area, more housing development is likely to occur there—resulting in a higher probability of catastrophic outcomes if the barrier fails.46 To draw a parallel in cybersecurity, consider brittle internal applications left to languish with insecure design due to belief that a firewall or intrusion detection system (IDS) will block attackers from accessing and exploiting it.

The resilience approach realizes that change is the language of the universe. Without learning from experience, we can’t adapt to reality. Resilience also recognizes that thriving is surviving. Robustness, the ability to withstand an adverse event like an attack, is not enough.

Myth: We Can and Should Prevent Failure

The traditional cybersecurity industry is focused on preventing failure from happening. The goal is to “thwart” threats, “stop” exploits, “block” attacks, and other aggressive verbiage to describe what ultimately is prevention of attackers performing any activity in your organization’s technological ecosystem. While illuminating incident reviews (sometimes called “postmortems”) on performance-related incidents are often made public, security incidents are treated as shameful affairs that should only be discussed behind closed doors or among fee-gated information-sharing groups. Failure is framed in moral terms of “the bad guys winning”—so it’s no wonder the infosec industry discourse often feels so doom-and-gloom. The goal is to somehow prevent all problems all the time, which, ironically, sets the security program up for failure (not to mention smothers a learning culture).

Related to this obsession with prevention is the more recent passion for prediction. The FAIR methodology, a quantitative risk analysis model designed for cybersecurity, is common in traditional security programs and requires assumptions about likelihood of “loss event frequency” as well as “loss magnitude.” The thinking goes that if you can predict the types of attacks you’ll experience, how often they’ll occur, and what the impact will be, you can determine the right amount to spend on security stuff and how it should be spent.

But accurate forecasting is impossible in complex systems; if you think security forecasting is hard, talk to a meteorologist. Our predilection for prediction may have helped Homo sapiens survive by solving linear problems, like how prey will navigate a forest, but it arguably hurts more than it helps in a modern world replete with a dizzying degree of interactivity. Since resilience is something a system does rather than has, it can’t be quantified by probabilities of disruptive events. As seismologist Susan Elizabeth Hough quipped in the context of natural disasters, “A building doesn’t care if an earthquake or shaking was predicted or not; it will withstand the shaking, or it won’t.”47

Attempting to prevent or predict failure in our complex computer systems is a costly activity because it distracts us and takes away resources from actually preparing for failure. Treating failure as a learning opportunity and preparing for it are more productive and pragmatic endeavors. Instead of spending so much time predicting what amount of resources to spend and where, you can conduct experiments and learn from experience to continually refine your security program based on tangible evidence. (And this approach requires spending much less time in Excel, which is a plus for most.)

The “bad guys”48 will sometimes win. That’s just a dissatisfying part of life, no matter where you look across humanity. But what we can control is how much we suffer as a result. Detecting failures in security controls early can mean the difference between an unexploited vulnerability and having to announce a data breach to your customers. Resilience and chaos engineering embrace the reality that models will be incomplete, controls will fail, mitigations will be disabled—in other words, things will fail and continue failing as the world revolves and evolves. If we architect our systems to expect failure, proactively challenge our assumptions through experimentation, and incorporate what we learn as feedback into our strategy, we can more fully understand how our systems actually work and how to improve and best secure them.

Instead of seeking to stop failure from ever occurring, the goal in resilience and chaos engineering is to handle failure gracefully.49 Early detection of failure minimizes incident impact and also reduces post-incident cleanup costs. Engineers have learned that detecting service failures early—like plethoric latency on a payment API—reduces the cost of a fix, and security failure is no different.

Thus we arrive at two core guiding principles of SCE:

  1. Expect security controls to fail and prepare accordingly.

  2. Do not attempt to completely avoid incidents, but instead embrace the ability to quickly and effectively respond to them.

Under the first principle, systems must be designed under this assumption that security controls will fail and that users will not immediately understand (or care about) the security implications of their actions.50 Under the second principle, as described by ecological economics scholar Peter Timmerman, resilience can be thought of as the building of “buffering capacity” into a system to continually strengthen its ability to cope in the future.51

It is essential to accept that compromise and mistakes will happen, and to maintain a focus on ensuring our systems can gracefully handle adverse events. Security must move away from defensive postures to resilient postures, letting go of the impossible standard of perfect prevention.

Myth: The Security of Each Component Adds Up to Resilience

Because resilience is an emergent property at the systems level, it can’t be measured by analyzing components. This is quite unfortunate since traditional cybersecurity is largely grounded in the component level, whether evaluating the security of components or protecting components. Tabulating vulnerabilities in individual components is seen as providing evidence of how secure the organization is against attacks. From a resilience engineering perspective, that notion is nonsense. If we connect a frontend receiving input to a database and verify each component is “secure” individually, we may miss the potential for SQL injection that arises from their interaction.

Mitigation and protection at the component level is partially due to this myth. Another belief, which practitioners are more reticent to acknowledge, is that addressing security issues one by one, component by component, is comparatively more convenient than working on the larger picture of systems security. Gravitating toward work that feels easier and justifying it with a more profound impetus (like this myth) is a natural human tendency. To wit, we see the same focus on process redesign and component-level safety engineering in healthcare too.52 The dearth of knowledge and understanding about how complex systems work spans industries.

The good news is we don’t really need precise measurement to assess resilience (as we’ll detail in Chapter 2). Evaluating both brittleness and resilience comes from observing the system in both adverse and healthy scenarios. This is the beauty of SCE: by conducting specific experiments, you can test hypotheses about your system’s resilience to different types of adverse conditions and observe how your systems respond to them. Like tiles in a mosaic, each experiment creates a richer picture of your system to help you understand it better. As we’ll explore in Chapter 7, we can even approach security as a product, applying the scientific rigor and experimentation that helps us achieve better product outcomes.

Domains outside software don’t have this luxury. We don’t want to inject cyclones onto the Great Barrier Reef as an experiment to evaluate its resilience, nor would we want to inject adverse conditions into a national economy, a human body on the operating table, an urban water system, an airplane mid-flight, or really any other real, live system on which humans depend. Since (as far as we know) computers aren’t sentient, and as long as we gain consent from human stakeholders in our computer systems, then this domain possesses a huge advantage relative to other domains in understanding systems resilience because we can conduct security chaos experiments.

Myth: Creating a “Security Culture” Fixes Human Error

The cybersecurity industry thinks and talks a lot about “security culture,” but this term means different things depending on whom you ask. Infosec isn’t alone in this focus; other domains, especially healthcare,53 have paid a lot of attention to fostering a “safety culture.” But at the bedrock of this focus on “culture”—no matter the industry—is its bellicose insistence that the humans intertwined with systems must focus more on security (or “safety” in other domains). In cybersecurity, especially, this is an attempt to distribute the burden of security to the rest of the organization—including accountability for incidents. Hence, we see an obsession with preventing users from clicking on things, despite the need in their work to click on many things many times a day. One might characterize the cynosure of infosec “security culture” as preventing people from clicking things on the thing-clicking machine—the modern monomania of subduing the internet era’s indomitable white whale.

Discussions about culture offer little impact without being grounded in the reality of the dynamic, complex systems in which humans operate. It is easy to suggest that all would be ameliorated if humans simply paid more attention to security concerns in the course of their work, but such recommendations are unlikely to stick. More fruitful is understanding why security concerns are overlooked—whether due to competing priorities, production pressures, attention pulled in multiple ways, confusing alert messages, and beyond. And then, which work do we mean? Paying attention to security means something quite different to a procurement professional, who is frequently interacting with external parties, than it does to a developer building APIs operating in production environments.

Any fruitful discussion in this vein is often stifled by poorly chosen health metrics and other “security vanity” metrics. They simply don’t provide the full picture of organizational security, but rather lure practitioners into thinking they understand because quantification feels like real science—just as shamanism, humoralism, and astrology felt in prior eras. The percentage of users clicking on phishing links in your organization does not tell you whether your organization will experience a horrific incident if a user ends up clicking something they shouldn’t. If the number of vulnerabilities you find goes up, is that good or bad? Perhaps the number of vulnerabilities found in applications matters less than whether these vulnerabilities are being found earlier in the software delivery lifecycle, if they are being remediated more quickly or, even better, if the production environment is architected so that their exploitation doesn’t result in any material impact. Needless to say, a metric like the percent of “risk coverage” (the goal being 100% coverage of this phantasmal ectoplasm) is little more than filler to feed to executives and boards of directors who lack technical expertise.

The other distressing byproduct of traditional attempts to foster a “security culture” is the focus on punishing humans or treating them as enemies labeled “insider threats.” Status quo security programs may buy tools with machine learning systems and natural language processing capabilities to figure out which employees sound sad or angry to detect “insider threats” early.54 Or security teams may install spyware55 on employee equipment to detect if they are looking for other jobs or exhibit other arbitrary symptoms of being upset with the organization. In the security status quo, we would rather treat humans like Schrödinger’s attacker than dissect the organizational factors that could lead to this form of vulnerability in the first place. This may placate those in charge, but it represents our failure in fostering organizational security.

Cybersecurity isn’t the only problem domain displaying this tendency. Yale sociologist Charles Perrow observed across complex systems that:

Virtually every system we will examine places “operator error” high on its list of causal factors — generally about 60 to 80 percent of accidents are attributed to this factor. But if, as we shall see time and time again, the operator is confronted by unexpected and usually mysterious interactions among failures, saying that he or she should have zigged instead of zagged is possible only after the fact.56

Remarkably, the cybersecurity industry ascribes about the same proportion of failures to “human error” too. The precise contribution of purported “human error” in data breaches depends on which source you view: the 2022 Verizon Data Breach Investigations Report says that 82% of breaches “involved a human element,” while another study from 2021 reported human error as the cause of 88% of data breaches. Of the breaches reported to the Information Commissioner’s Office (ICO) between 2017 and 2018, 90% cited human error as the cause too. But if you dig into what those human errors were, they often represent actions that are entirely benign in other contexts, like clicking links, pushing code, updating configurations, sharing data, or entering credentials into login pages.

SCE emphasizes the importance of outside perspectives. We should consider our adversaries’ perspectives (as we’ll discuss in Chapter 2) and conduct user research (as we’ll discuss in Chapter 7), developing an understanding of the perspectives of those who are building, maintaining, and using systems so that the security program is not based on a fantasy. Adopting a systems perspective is the first step in better coping with security failure, as it allows you to see how a combination of chronic stressors and acute stressors leads to failure.

With SCE, the security program can provide immense value to its organization by narrowing the gap between work-as-imagined and work-as-performed.57 Systems will encroach upon their safety thresholds when policies and procedures are designed based on ideal operational behaviors (by humans and machines alike). Expecting humans to perform multiple steps sequentially without any reminders or visual aids is a recipe for mistakes and omissions.

Security programs in SCE are also curious about workarounds rather than forbidding them. Workarounds can actually support resilience. Humans can be quite adept at responding to competing pressures and goals, creating workarounds that allow the system to sustain performance of its critical functions. When the workarounds are eliminated, it’s harder for humans interacting with the system to use a variety of strategies in response to the variety of behaviors they may encounter during their work. And that erodes resilience. Unlike traditional infosec wisdom, SCE sees workarounds for what they are: adaptations in response to evolving conditions that are natural in complex systems.58 Workarounds are worth understanding in devising the right procedures for humans to operate flexibly and safely.

Chapter Takeaways

  • All of our software systems are complex. Complex systems are filled with variety, are adaptive, and are holistic in nature.

  • Failure is when systems—or components within systems—do not operate as intended. In complex systems, failure is inevitable and happening all the time. What matters is how we prepare for it.

  • Failure is never the result of one factor; there are multiple influencing factors working in concert. Acute and chronic stressors are factors, as are computer and human surprises.

  • Resilience is the ability for a system to gracefully adapt its functioning in response to changing conditions so it can continue thriving.

  • Resilience is the foundation of security chaos engineering. Security Chaos Engineering (SCE) is a set of principles and practices that help you design, build, and operate complex systems that are more resilient to attack.

  • The five ingredients of the “resilience potion” include understanding a system’s critical functionality; understanding its safety boundaries; observing interactions between its components across space and time; fostering feedback loops and a learning culture; and maintaining flexibility and openness to change.

  • Resilience is a verb. Security, as a subset of resilience, is something a system does, not something a system has.

  • SCE recognizes that a resilient system is one that performs as needed under a variety of conditions and can respond appropriately both to disturbances—like threats—as well as opportunities. Security programs are meant to help the organization anticipate new types of hazards as well as opportunities to innovate to be even better prepared for the next incident.

  • There are many myths about resilience, four of which we covered: resilience is conflated with robustness, the ability the “bounce back” to normal after an attack; the belief that we can and should prevent failure (which is impossible); the myth that the security of each component adds up to the security of the whole system; and that creating a “security culture” fixes the “human error” problem.

  • SCE embraces the idea that failure is inevitable and uses it as a learning opportunity. Rather than preventing failure, we must prioritize handling failure gracefully—which better aligns with organizational goals too.

1 George V. Neville-Neil, “Securing the Company Jewels,” Communications of the ACM 65, no. 10 (2022): 25-26.

2 Richard J. Holden, “People or Systems? To Blame Is Human. The Fix Is to Engineer,” Professional Safety 54, no. 12 (2009): 34.

3 David D. Woods, “Engineering Organizational Resilience to Enhance Safety: A Progress Report on the Emerging Field of Resilience Engineering,” Proceedings of the Human Factors and Ergonomics Society Annual Meeting 50, no. 19 (October 2006): 2237-2241.

4 Dirk G. Baur, “Financial Contagion and the Real Economy,” Journal of Banking & Finance 36, no. 10 (2012): 2680-2692; Kristin Forbes, “The ‘Big C’: Identifying Contagion,” National Bureau of Economic Research, Working Paper 18465 (2012).

5 Iacopo Iacopini et al., “Simplicial Models of Social Contagion,” Nature Communications 10, no. 1 (2019): 2485; Sinan Aral and Christos Nicolaides, “Exercise Contagion in a Global Social Network,” Nature Communications 8, no. 1 (2017): 14753.

6 Steven Sanche et al., “High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2,” Emerging Infectious Diseases 26, no. 7 (2020): 1470.

7 A modern Diogenes might hold up Mr. Potato Head and declare, “Behold! A linear system!”

8 Amy Rankin et al., “Resilience in Everyday Operations: A Framework for Analyzing Adaptations in High-Risk Work,” Journal of Cognitive Engineering and Decision Making 8, no. 1 (2014): 78-97.

9 Rankin, “Resilience in Everyday Operations,” 78-97.

10 Joonhong Ahn et al., eds., Reflections on the Fukushima Daiichi Nuclear Accident: Toward Social-Scientific Literacy and Engineering Resilience (Berlin: Springer Nature, 2015).

11 Nick Chater and George Loewenstein, “The Under-Appreciated Drive for Sense-Making,” Journal of Economic Behavior & Organization 126 (2016): 137-154.

12 Institute of Medicine, Hospital-Based Emergency Care: At the Breaking Point (Washington, DC: The National Academies Press, 2007).

13 Christopher Nemeth et al., “Minding the Gaps: Creating Resilience in Health Care,” in Advances in Patient Safety: New Directions and Alternative Approaches, Vol. 3: Performance and Tools (Rockville, MD: Agency for Healthcare Research and Quality, August 2008).

14 Len Fisher, “To Build Resilience, Study Complex Systems,” Nature 595, no. 7867 (2021): 352-352.

15 Nancy G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (Cambridge, MA: MIT Press, 2016).

16 Pertinent domains include disaster management (e.g., flood resilience), climate change (e.g., agriculture, coral reef management), and safety-critical industries like aviation and medicine.

17 Lynn Margulis and Dorion Sagan, What Is Life? (Berkeley and Los Angeles, CA: University of California Press, 2000).

18 As just one example, Facebook published a study of memory errors at scale and found that 9.62% of servers experienced correctable memory errors (cumulatively across all months)—leading them to “corroborate the trend that memory errors are still a widespread problem in the field” (emphasis theirs).

19 Jane Jacobs, The Death and Life of Great American Cities (New York: Vintage Books, 1992).

20 Susan L. Cutter et al., “Disaster Resilience: A National Imperative,” Environment: Science and Policy for Sustainable Development 55, no. 2 (2013): 25-29.

21 Elizabeth B. Connelly et al., “Features of Resilience,” Environment Systems and Decisions 37, no. 1 (2017): 46-50.

22 Jens Rasmussen, “Risk Management in a Dynamic Society: A Modelling Problem,” Safety Science 27, nos. 2-3 (1997): 183-213.

23 K. J. Willis et al., “Biodiversity Baselines, Thresholds and Resilience: Testing Predictions and Assumptions Using Palaeoecological Data,” Trends in Ecology & Evolution 25, no. 10 (2010): 583-591.

24 Didier L. Baho et al., “A Quantitative Framework for Assessing Ecological Resilience,” Ecology & Society 22, no. 3 (2017): 17.

25 This is known as the “collapse” or “confusion” part of the adaptive cycle. Brian D. Fath et al., “Navigating the Adaptive Cycle: An Approach to Managing the Resilience of Social Systems,” Ecology & Society 20, no. 2 (2015): 24.

26 The book delves surprisingly deep into nonlinear systems theory, while the movie only scratches the surface.

27 Arie Staal et al., “Hysteresis of Tropical Forests in the 21st Century,” Nature Communications 11, no. 4978 (2020).

28 Craig R. Allen et al., “Quantifying Spatial Resilience,” Journal of Applied Ecology 53, no. 3 (2016): 625-635.

29 Carlo Rovelli, “The Disappearance of Space and Time,” in Philosophy and Foundations of Physics: The Ontology of Spacetime, 25-36 (Amsterdam and Oxford, UK: Elsevier, 2006).

30 Terry P. Hughes et al., “Ecological Memory Modifies the Cumulative Impact of Recurrent Climate Extremes,” Nature Climate Change 9, no. 1 (2019): 40-43.

31 Jill F. Johnstone et al., “Changing Disturbance Regimes, Ecological Memory, and Forest Resilience,” Frontiers in Ecology and the Environment 14, no. 7 (2016): 369-378.

32 Fath, “Navigating the Adaptive Cycle.”

33 John Ljungkvist et al., “The Urban Anthropocene: Lessons for Sustainability from the Environmental History of Constantinople,” in The Urban Mind: Cultural and Environmental Dynamics, 367-390 (Uppsala, Sweden: Uppsala University Press, 2010).

34 Nick Brooks et al., “Assessing and Enhancing Adaptive Capacity,” in Adaptation Policy Frameworks for Climate Change: Developing Strategies, Policies and Measures, 165-181 (New York and Cambridge, UK: UNDP and Cambridge University Press, 2005).

35 Carl Folke et al., “Resilience and Sustainable Development: Building Adaptive Capacity in a World of Transformations,” AMBIO: A Journal of the Human Environment 31, no. 5 (2002): 437-440.

36 Shana M. Sundstrom and Craig R. Allen, “The Adaptive Cycle: More Than a Metaphor.” Ecological Complexity 39 (August): 100767.

37 Woods, “Engineering Organizational Resilience to Enhance Safety,” 2237-2241.

38 Nemeth, “Minding the Gaps.”

39 J. Park et al., “Integrating Risk and Resilience Approaches to Catastrophe Management in Engineering Systems,” Risk Analysis 33, no. 3 (2013): 356-366.

40 Connelly, “Features of Resilience,” 46-50.

41 Richard Cook and Jens Rasmussen, “Going Solid: A Model of System Dynamics and Consequences for Patient Safety,” BMJ Quality & Safety 14, no. 2 (2005): 130-134.

42 Alan Lightman, Probable Impossibilities: Musings on Beginnings and Endings (New York: Pantheon Books, 2021).

43 Rankin, “Resilience in Everyday Operations,” 78-97.

44 Adriana X. Sanchez et al., “Are Some Forms of Resilience More Sustainable Than Others?” Procedia Engineering 180 (2017): 881-889.

45 This is known as the “safe development paradox”: the anticipated safety gained by introducing a technical solution to a problem instead facilitates risk accumulation over time, leading to larger potential damage in the event of an incident. See Raymond J. Burby, “Hurricane Katrina and the Paradoxes of Government Disaster Policy: Bringing About Wise Governmental Decisions for Hazardous Areas,” The ANNALS of the American Academy of Political and Social Science 604, no. 1 (2006): 171-191.

46 Caroline Wenger, “The Oak or the Reed: How Resilience Theories Are Translated into Disaster Management Policies,” Ecology and Society 22, no. 3 (2017): 18.

47 Susan Elizabeth Hough, Predicting the Unpredictable, reprint ed. (Princeton, NJ: Princeton University Press, 2016).

48 We will not use the term bad guys or bad actors throughout this book for a variety of reasons, including the infantile worldview it confesses. Nevertheless, you are likely to encounter it in typical infosec discourse as an attempt at invective against attackers broadly.

49 See Bill Hoffman’s tenets of operations-friendly services, by way of James R. Hamilton, “On Designing and Deploying Internet-Scale Services,” LISA 18 (November 2007): 1-18.

50 “End users” and “system admins” are continually featured as “top actors” involved in data breaches in the annual editions of the Verizon Data Breach Investigations Report (DBIR) (2020 Report).

51 Peter Timmerman, “Vulnerability, Resilience and the Collapse of Society,” Environmental Monograph No. 1 (1981): 1-42.

52 Nemeth, “Minding the Gaps.”

53 Cook, “Going Solid,” 130-134.

54 There are a few examples of this sort of sentiment analysis to detect “insider threats,” one of which is: https://oreil.ly/qTA76. See CMU’s blog post for the challenges associated with it in practice.

55 Usually vendors won’t say “keylogging” explicitly, but will use euphemisms like “keystroke dynamics,” “keystroke logging,” or “user behavior analytics.” As CISA explains in their Insider Threat Mitigation Guide about User Activity Monitoring (UAM), “In general, UAM software monitors the full range of a user’s cyber behavior. It can log keystrokes, capture screenshots, make video recordings of sessions, inspect network packets, monitor kernels, track web browsing and searching, record the uploading and downloading of files, and monitor system logs for a comprehensive picture of activity across the network.” See also the presentation “Exploring keystroke dynamics for insider threat detection”.

56 Charles Perrow, Normal Accidents: Living with High-Risk Technologies, Revised Edition (Princeton, NJ: Princeton University Press, 1999).

57 Manikam Pillay and Gaël Morel, “Measuring Resilience Engineering: An Integrative Review and Framework for Bench-marking Organisational Safety,” Safety 6, no. 3 (2020): 37.

58 Rankin, “Resilience in Everyday Operations,” 78-97.

Get Security Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.