Data Engineering Design Patterns

Book description

Data projects are an intrinsic part of an organization's technical ecosystem, but data engineers in many companies are still trying to solve problems that others have already solved. This hands-on guide shows you how to provide valuable data by focusing on various aspects of data engineering, including data ingestion, data quality, idempotency, and more.

Author Bartosz Konieczny guides you through the process of building reliable end-to-end data engineering projects, from data ingestion to data observability, focusing on data engineering design patterns that solve common business problems in a secure and storage-optimized manner. Each pattern includes a user-facing description of the problem, solutions, and consequences that place the pattern into the context of real-life scenarios.

Throughout this journey, you'll use open source data tools and public cloud services to see how to put each pattern into practice. You'll learn:

  • Challenges data engineers face and their impact on data systems
  • How these challenges relate to data system components
  • What data engineering patterns are for
  • How to identify and fix issues with your current data components
  • Technology-agnostic solutions to new and existing data projects
  • How to implement patterns with Apache Airflow, Apache Spark, Apache Flink, and Delta Lake

Bartosz Konieczny is a freelance data engineer who's been coding for more than 15 years. He's held various senior hands-on positions that helped him work on many data engineering problems in batch and stream processing.

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. 1. Introducing data engineering design patterns
    1. What are design patterns?
    2. Yet another design patterns?
    3. Common data engineering patterns
    4. Case study used in the book
    5. Summary
  3. 2. Data ingestion design patterns
    1. Full load
      1. Pattern: Full loader
    2. Incremental load
      1. Pattern: Incremental loader
      2. Pattern: Change Data Capture
    3. Replication
      1. Pattern: Passthrough replicator
      2. Pattern: Transformation replicator
    4. Data compaction
      1. Pattern: Compactor
    5. Data readiness
      1. Pattern: Readiness marker
    6. Event-driven
      1. Pattern: External trigger
    7. Summary
  4. 3. Error management design patterns
    1. Unprocessable records
      1. Pattern: Dead-Letter
    2. Duplicated records
      1. Pattern: Windowed deduplicator
    3. Late data
      1. Pattern: Late data detector
      2. Pattern: Sequential late data integrator
      3. Pattern: Concurrent late data integrator
    4. Filtering
      1. Pattern: Filter interceptor
    5. Fault-tolerance
      1. Pattern: Checkpointer
    6. Summary
  5. 4. Idempotency design patterns
    1. Overwriting
      1. Pattern: Fast metadata cleaner
      2. Pattern: Data overwrite
    2. Updates
      1. Pattern: Merger
    3. Database
      1. Pattern: Keyed idempotency
      2. Pattern: Transactional writer
    4. Immutable dataset
      1. Pattern: Proxy
    5. Summary
  6. 5. Data value design patterns
    1. Data enrichment
      1. Pattern: Static joiner
      2. Pattern: Dynamic joiner
    2. Data decoration
      1. Pattern: Wrapper
      2. Pattern: Metadata decorator
    3. Data aggregation
      1. Pattern: Distributed aggregator
      2. Pattern: Local aggregator
    4. Sessionization
      1. Pattern: Incremental sessionizer
      2. Pattern: Stateful sessionizer
    5. Data ordering
      1. Pattern: Bin pack orderer
      2. Pattern: FIFO orderer
    6. Summary
  7. 6. Data Flow design patterns
    1. Sequence
      1. Pattern: Local sequencer
      2. Pattern: Isolated sequencer
    2. Fan-in
      1. Pattern: Aligned fan-in
      2. Pattern: Unaligned fan-in
    3. Fan-out
      1. Pattern: Parallel split
      2. Pattern: Exclusive choice
    4. Orchestration
      1. Pattern: Single runner
      2. Pattern: Concurrent runner
    5. Summary
  8. About the Author

Product information

  • Title: Data Engineering Design Patterns
  • Author(s): Bartosz Konieczny
  • Release date: April 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098165819