Security Chaos Engineering

Book description

Cybersecurity is broken. Year after year, attackers remain unchallenged and undeterred, while engineering teams feel pressure to design, build, and operate "secure" systems. Failure can't be prevented, mental models of systems are incomplete, and our digital world constantly evolves. How can we verify that our systems behave the way we expect? What can we do to improve our systems' resilience?

In this comprehensive guide, authors Kelly Shortridge and Aaron Rinehart help you navigate the challenges of sustaining resilience in complex software systems by using the principles and practices of security chaos engineering. By preparing for adverse events, you can ensure they don't disrupt your ability to innovate, move quickly, and achieve your engineering and business goals.

  • Learn how to design a modern security program
  • Make informed decisions at each phase of software delivery to nurture resilience and adaptive capacity
  • Understand the complex systems dynamics upon which resilience outcomes depend
  • Navigate technical and organizational trade-offsthat distort decision making in systems
  • Explore chaos experimentation to verify critical assumptions about software quality and security
  • Learn how major enterprises leverage security chaos engineering

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book?
    2. Scope of This Book
      1. Outline of This Book
    3. Conventions Used in This Book
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Resilience in Software and Systems
    1. What Is a Complex System?
      1. Variety Defines Complex Systems
      2. Complex Systems Are Adaptive
      3. The Holistic Nature of Complex Systems
    2. What Is Failure?
      1. Acute and Chronic Stressors in Complex Systems
      2. Surprises in Complex Systems
    3. What Is Resilience?
      1. Critical Functionality
      2. Safety Boundaries (Thresholds)
      3. Interactions Across Space-Time
      4. Feedback Loops and Learning Culture
      5. Flexibility and Openness to Change
    4. Resilience Is a Verb
    5. Resilience: Myth Versus Reality
      1. Myth: Robustness = Resilience
      2. Myth: We Can and Should Prevent Failure
      3. Myth: The Security of Each Component Adds Up to Resilience
      4. Myth: Creating a “Security Culture” Fixes Human Error
    6. Chapter Takeaways
  3. 2. Systems-Oriented Security
    1. Mental Models of System Behavior
      1. How Attackers Exploit Our Mental Models
      2. Refining Our Mental Models
    2. Resilience Stress Testing
    3. The E&E Resilience Assessment Approach
    4. Evaluation: Tier 1 Assessment
      1. Mapping Flows to Critical Functionality
      2. Document Assumptions About Safety Boundaries
      3. Making Attacker Math Work for You
      4. Starting the Feedback Flywheel with Decision Trees
      5. Moving Toward Tier 2: Experimentation
    5. Experimentation: Tier 2 Assessment
      1. The Value of Experimental Evidence
      2. Sustaining Resilience Assessments
    6. Fail-Safe Versus Safe-to-Fail
      1. Uncertainty Versus Ambiguity
      2. Fail-Safe Neglects the Systems Perspective
      3. The Fragmented World of Fail-Safe
    7. SCE Versus Security Theater
      1. What Is Security Theater?
      2. How Does SCE Differ from Security Theater?
    8. How to RAVE Your Way to Resilience
      1. Repeatability: Handling Complexity
      2. Accessibility: Making Security Easier for Engineers
      3. Variability: Supporting Evolution
    9. Chapter Takeaways
  4. 3. Architecting and Designing
    1. The Effort Investment Portfolio
      1. Allocating Your Effort Investment Portfolio
      2. Investing Effort Based on Local Context
    2. The Four Failure Modes Resulting from System Design
    3. The Two Key Axes of Resilient Design: Coupling and Complexity
      1. Designing to Preserve Possibilities
    4. Coupling in Complex Systems
      1. The Tight Coupling Trade-Off
      2. The Dangers of Tight Coupling: Taming the Forest
      3. Investing in Loose Coupling in Software Systems
      4. Chaos Experiments Expose Coupling
    5. Complexity in Complex Systems
      1. Understanding Complexity: Essential and Accidental
      2. Complexity and Mental Models
      3. Introducing Linearity into Our Systems
      4. Designing for Interactivity: Identity and Access Management
      5. Navigating Flawed Mental Models
    6. Chapter Takeaways
  5. 4. Building and Delivering
    1. Mental Models When Developing Software
    2. Who Owns Application Security (and Resilience)?
      1. Lessons We Can Learn from Database Administration Going DevOps
    3. Decisions on Critical Functionality Before Building
      1. Defining System Goals and Guidelines on “What to Throw Out the Airlock”
      2. Code Reviews and Mental Models
      3. “Boring” Technology Is Resilient Technology
      4. Standardization of Raw Materials
    4. Developing and Delivering to Expand Safety Boundaries
      1. Anticipating Scale and SLOs
      2. Automating Security Checks via CI/CD
      3. Standardization of Patterns and Tools
      4. Dependency Analysis and Prioritizing Vulnerabilities
    5. Observe System Interactions Across Space-Time (or Make More Linear)
      1. Configuration as Code
      2. Fault Injection During Development
      3. Integration Tests, Load Tests, and Test Theater
      4. Beware Premature and Improper Abstractions
    6. Fostering Feedback Loops and Learning During Build and Deliver
      1. Test Automation
      2. Documenting Why and When
      3. Distributed Tracing and Logging
      4. Refining How Humans Interact with Build and Delivery Practices
    7. Flexibility and Willingness to Change
      1. Iteration to Mimic Evolution
      2. Modularity: Humanity’s Ancient Tool for Resilience
      3. Feature Flags and Dark Launches
      4. Preserving Possibilities for Refactoring: Typing
      5. The Strangler Fig Pattern
    8. Chapter Takeaways
  6. 5. Operating and Observing
    1. What Does Operating and Observing Involve?
    2. Operational Goals in SCE
      1. The Overlap of SRE and Security
      2. Measuring Operational Success
      3. Crafting Success Metrics like Attackers
      4. The DORA Metrics
      5. SLOs, SLAs, and Principled Performance Analytics
      6. Embracing Confidence-Based Security
    3. Observability for Resilience and Security
      1. Thresholding to Uncover Safety Boundaries
      2. Attack Observability
    4. Scalable Is Safer
      1. Navigating Scalability
      2. Automating Away Toil
    5. Chapter Takeaways
  7. 6. Responding and Recovering
    1. Responding to Surprises in Complex Systems
      1. Incident Response and the Effort Investment Portfolio
      2. Action Bias in Incident Response
      3. Practicing Response Activities
    2. Recovering from Surprises
      1. Blameless Culture
      2. Blaming Human Error
      3. Hindsight Bias and Outcome Bias
      4. The Just-World Hypothesis
      5. Neutral Practitioner Questions
    3. Chapter Takeaways
  8. 7. Platform Resilience Engineering
    1. Production Pressures and How They Influence System Behavior
    2. What Is Platform Engineering?
    3. Defining a Vision
    4. Defining a User Problem
      1. Local Context Is Critical
      2. User Personas, Stories, and Journeys
      3. Understanding How Humans Make Trade-Offs Under Pressure
    5. Designing a Solution
      1. The Ice Cream Cone Hierarchy of Security Solutions
      2. System Design and Redesign to Eliminate Hazards
      3. Substitute Less Hazardous Methods or Materials
      4. Incorporate Safety Devices and Guards
      5. Provide Warning and Awareness Systems
      6. Apply Administrative Controls Including Guidelines and Training
      7. Two Paths: The Control Strategy or the Resilience Strategy
      8. Experimentation and Feedback Loops for Solution Design
    6. Implementing a Solution
      1. Fostering Consensus
      2. Planning for Migration
      3. Success Metrics
    7. Chapter Takeaways
  9. 8. Security Chaos Experiments
    1. Lessons Learned from Early Adopters
      1. Lesson #1. Start in Nonproduction Environments; You Can Still Learn a Lot
      2. Lesson #2. Use Past Incidents as a Source of Experiments
      3. Lesson #3. Publish and Evangelize Experimental Findings
    2. Setting Experiments Up for Success
    3. Designing a Hypothesis
    4. Designing an Experiment
    5. Experiment Design Specifications
    6. Conducting Experiments
      1. Collecting Evidence
    7. Analyzing and Documenting Evidence
      1. Capturing Knowledge for Feedback Loops
      2. Document Experiment Release Notes
    8. Automating Experiments
    9. Easing into Chaos: Game Days
    10. Example Security Chaos Experiments
      1. Security Chaos Experiments for Production Infrastructure
      2. Security Chaos Experiments for Build Pipelines
      3. Security Chaos Experiments in Cloud Native Environments
      4. Security Chaos Experiments in Windows Environments
    11. Chapter Takeaways
  10. 9. Security Chaos Engineering in the Wild
    1. Experience Report: The Existence of Order Through Chaos (UnitedHealth Group)
      1. The Story of ChaoSlingr
      2. Step-by-Step Example: PortSlingr
    2. Experience Report: A Quest for Stronger Reliability (Verizon)
      1. The Bigger They Are…
      2. All Hands on Deck Means No Hands on the Helm
      3. Assert Your Hypothesis
      4. Reliability Experiments
      5. Cost Experiments
      6. Performance Experiments
      7. Risk Experiments
      8. More Traditionally Known Experiments
      9. Changing the Paradigm to Continuous
      10. Lessons Learned
    3. Experience Report: Security Monitoring (OpenDoor)
    4. Experience Report: Applied Security (Cardinal Health)
      1. Building the SCE Culture
      2. The Mission of Applied Security
      3. The Method: Continuous Verification and Validation (CVV)
      4. The CVV Process Includes Four Steps
    5. Experience Report: Balancing Reliability and Security via SCE (Accenture Global)
      1. Our Roadmap to SCE Enterprise Capability
      2. Our Process for Adoption
    6. Experience Report: Cyber Chaos Engineering (Capital One)
      1. What Does All This Have to Do with SCE?
      2. What Is Secure Today May Not Be Secure Tomorrow
      3. How We Started
      4. How We Did This in Ye Olden Days
      5. Things I’ve Learned Along the Way
      6. A Reduction of Guesswork
      7. Driving Value
      8. Conclusion
    7. Chapter Takeaways
  11. Index
  12. About the Authors

Product information

  • Title: Security Chaos Engineering
  • Author(s): Kelly Shortridge, Aaron Rinehart
  • Release date: March 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098113827