Engineering Resilient Systems on AWS

Book description

To ensure that applications are reliable and always available, more businesses today are moving applications to AWS. But many companies still struggle to design and build these cloud applications effectively, thinking that because the cloud is resilient, their applications will be too. With this practical guide, software, DevOps, and cloud engineers will learn how to implement resilient designs and configurations in the cloud using hands-on independent labs.

Authors Kevin Schwarz, Jennifer Moran, and Dr. Nate Bachmeier from AWS teach you how to build cloud applications that demonstrate resilience with patterns like back off and retry, multi-Region failover, data protection, and circuit breaker with common configuration, tooling, and deployment scenarios. Labs are organized into categories based on complexity and topic, making it easy for you to focus on the most relevant parts of your business.

You'll learn how to:

  • Configure and deploy AWS services using resilience patterns
  • Implement stateless microservices for high availability
  • Consider multi-Region designs to meet business requirements
  • Implement backup and restore, pilot light, warm standby, and active-active strategies
  • Build applications that withstand AWS Region and Availability Zone impairments
  • Use chaos engineering experiments for fault injection to test for resilience
  • Assess the trade-offs when building resilient systems, including cost, complexity, and operational burden

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
      1. From Kevin
      2. From Jennifer
      3. From Dr. Nate
  2. I. Foundations
  3. 1. Introduction
    1. People, Processes, and Technology
      1. The Role of People
      2. The Role of Processes
      3. Integrating People, Processes, and Technology
    2. Shared Responsibility Model
    3. AWS Responsibility
    4. Customer Responsibility
      1. Setting Objectives
      2. Workload Architecture
      3. Networking
      4. Quotas
      5. Change Management
      6. Failure Management
      7. Observability
      8. Continuous Testing and Chaos Engineering
      9. CI/CD and Automation
      10. Continuous Resilience
    5. Summary
  4. 2. Prepare Your Working Environment
    1. Hands-on Learning with Microservices
    2. AWS Account and Permissions
    3. Choosing a Development OS and IDE
    4. Git and Code Samples Repository
    5. Python Environment
    6. NPM and Node.js
    7. AWS CDK
    8. Additional Software
      1. AWS CLI
      2. Python Packages
      3. Vue.js and Vite
      4. Bootstrap CSS
      5. Artillery.io
      6. curl and watch
      7. Boto3
      8. PostgreSQL
      9. Lambda Powertools
      10. Docker Desktop
    9. Custom Domain and Route 53 Hosted Zone
    10. Security
      1. Encryption in Transit
      2. Encryption at Rest
      3. Authentication and Authorization for API Endpoints
      4. Tokenization
      5. Code Scanning
    11. Cleaning Up
    12. Summary
  5. II. Reliable Trading Portal
  6. 3. Frontend Web Application
    1. Technical Requirements
    2. Architecture Overview
    3. Deploying the AWS CDK Application
      1. Using an Amazon CloudFront Domain
      2. Amazon CloudFront
      3. Amazon Simple Storage Service
      4. Amazon Route 53
    4. Implementing Observability
    5. Injecting Failure Modes
      1. Introducing Excessive Load
      2. Introducing Excessive Latency
      3. Addressing Single Points of Failure
    6. Cleaning Up
    7. Summary
  7. 4. Serverless Account Open API
    1. Technical Requirements
    2. Architecture Overview: An AWS Serverless Approach
    3. Deploying the AWS CDK Application
    4. Sunny Day Scenario
    5. Strongly Typed Service Contracts
    6. Idempotent Responses
    7. Self-Healing with Message Queue Retries
    8. Rate Limiting: Throttle Unanticipated Load
    9. Surviving a Poison Pill
    10. STOP: Business Continuity Regional Switchover
    11. Returning to Business as Usual
    12. Blue-Green Testing
    13. Cleaning Up
    14. Summary
  8. 5. Containerized Trade Stock API
    1. Technical Requirements
    2. Architecture Overview
    3. Deploying the AWS CDK Application
      1. VpcStack
      2. TradeDatabaseStack
      3. TradeOrderStack
      4. TradeConfirmsStack
      5. Prepare the Database
    4. Container Deployment Failures
    5. Database Connection Exhaustion
    6. Database Password Rotation Login Failures
    7. Database Primary Writer Failures
    8. Dependency Intermittent Failures
    9. Detecting and Handling Availability Zone Issues
    10. Dependency Outages
    11. Cleaning Up
    12. Summary
  9. 6. Integrated AvailableTrade Frontend with APIs
    1. Technical Requirements
    2. Architecture Overview
    3. Deploying the AWS CDK Application
    4. Automating AvailableTrade Endpoint Configuration
    5. Integrating AvailableTrade Microservices
    6. Configuring Client Timeouts
    7. Gracefully Degrading Features
    8. Real User Monitoring
    9. X-Ray for End-to-End Tracing
    10. Cleaning Up
    11. Summary
  10. 7. When Recovery Is Required
    1. Architecture Overview
    2. Deploying the AWS CDK Application
      1. Deploy the AWS CDK Trade Stock Stack in Secondary Region
      2. Deploying the AWS CDK Orchestration Stack
      3. Integrating Backend API to Frontend
    3. Validating the Region
    4. Database Failover and Switchover
      1. Failover
      2. Switchover
    5. Scaling Compute
    6. Routing at the Lambda Layer
    7. DNS Failover
    8. Importance of Backups
    9. Avoiding Configuration Drift
    10. Failover Verification
    11. Cleaning Up
    12. Summary
  11. III. Discovering Trading Opportunities
  12. 8. Real-Time Market Data Analytics
    1. Technical Requirements
    2. Designing a Reliable Data Ingestion Layer
      1. Role of Apache Kafka in Data Ingestion
      2. Designing the Kafka Topic Structure
      3. Securing the Kafka Cluster
    3. Implementing Reliable Consumers
      1. Ensuring Fault Tolerance and Scalability
      2. Consumer Groups and Record Processing
      3. Handling Invalid Messages
      4. Dealing with Downstream Dependencies
    4. Integrating Consumers and APIs
      1. Creating the Connection
      2. Designing Consumer State
      3. Implementing State Management
      4. Handling Concurrency
      5. Using Restartability
    5. Storing and Querying Processed Market Data
      1. Handling Firehose Failure Modes
      2. Querying Athena
      3. Optimizing Data Storage and Querying Performance
    6. Monitoring and Observability
    7. Testing Resiliency
    8. Cleaning Up
    9. Summary
  13. 9. Building Reliable News Feed Ingestion and Search APIs
    1. Technical Requirements
    2. Fetching and Processing News Articles
      1. Producer-Consumer Pattern for Article Processing
      2. Leader Election for Scheduler High Availability
      3. Scheduler Configuration Failure Modes
      4. Additional Resiliency Strategies
    3. Syncing Articles to OpenSearch
    4. Serving Search Traffic
    5. Cleaning Up
    6. Summary
  14. 10. Building Resilient Multi-Region Architectures
    1. The Business Case for Multi-Region Architectures
    2. Multi-Region Database Architectures
      1. Understanding Consistency Models
      2. Replication Strategies
      3. Handling Conflict Resolution
    3. Multi-Region Streaming Architectures
      1. Replicating Kafka Data Across Regions
      2. Handling Active-Active Kafka Deployments
      3. Streaming Data to Other Destinations
    4. Multi-Region Search Architectures with OpenSearch
      1. Cross-Region Data Replication with OpenSearch
      2. Other Data Replication Options
    5. Caching in Multi-Region Architectures
    6. Summary
  15. 11. Putting It All Together
    1. Reviewing Core Concepts
      1. Reliability Frameworks
      2. Failure Modes with Reliability Patterns
      3. Connecting the Key Learnings
    2. Leading Resiliency Initiatives: Cultivating a Culture of Resilience
      1. Nurturing the Seeds of Resilience
      2. Becoming the Go-To Resilience Guru
      3. Sharpening Your Resilience Radar
      4. Embracing Continuous Resilience
      5. Making Resilience a Daily Habit
    3. Looking to the Future
      1. Navigating the Multicloud and Hybrid Cloud Landscape
      2. Harnessing AI for Resilience
      3. Embracing Chaos Engineering
      4. Leveraging Observability
    4. Summary
  16. Index
  17. About the Authors

Product information

  • Title: Engineering Resilient Systems on AWS
  • Author(s): Kevin Schwarz, Jennifer Moran, Nate Bachmeier
  • Release date: October 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098162429