Data Pipelines Pocket Reference

Book description

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack.

You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions.

You'll learn:

  • What a data pipeline is and how it works
  • How data is moved and processed on modern data infrastructure, including cloud platforms
  • Common tools and products used by data engineers to build pipelines
  • How pipelines support analytics and reporting needs
  • Considerations for pipeline maintenance, testing, and alerting

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who This Book Is For
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Introduction to Data Pipelines
    1. What Are Data Pipelines?
    2. Who Builds Data Pipelines?
      1. SQL and Data Warehousing Fundamentals
      2. Python and/or Java
      3. Distributed Computing
      4. Basic System Administration
      5. A Goal-Oriented Mentality
    3. Why Build Data Pipelines?
    4. How Are Pipelines Built?
  3. 2. A Modern Data Infrastructure
    1. Diversity of Data Sources
      1. Source System Ownership
      2. Ingestion Interface and Data Structure
      3. Data Volume
      4. Data Cleanliness and Validity
      5. Latency and Bandwidth of the Source System
    2. Cloud Data Warehouses and Data Lakes
    3. Data Ingestion Tools
    4. Data Transformation and Modeling Tools
    5. Workflow Orchestration Platforms
      1. Directed Acyclic Graphs
    6. Customizing Your Data Infrastructure
  4. 3. Common Data Pipeline Patterns
    1. ETL and ELT
    2. The Emergence of ELT over ETL
    3. EtLT Subpattern
    4. ELT for Data Analysis
    5. ELT for Data Science
    6. ELT for Data Products and Machine Learning
      1. Steps in a Machine Learning Pipeline
      2. Incorporate Feedback in the Pipeline
      3. Further Reading on ML Pipelines
  5. 4. Data Ingestion: Extracting Data
    1. Setting Up Your Python Environment
    2. Setting Up Cloud File Storage
    3. Extracting Data from a MySQL Database
      1. Full or Incremental MySQL Table Extraction
      2. Binary Log Replication of MySQL Data
    4. Extracting Data from a PostgreSQL Database
      1. Full or Incremental Postgres Table Extraction
      2. Replicating Data Using the Write-Ahead Log
    5. Extracting Data from MongoDB
    6. Extracting Data from a REST API
    7. Streaming Data Ingestions with Kafka and Debezium
  6. 5. Data Ingestion: Loading Data
    1. Configuring an Amazon Redshift Warehouse as a Destination
    2. Loading Data into a Redshift Warehouse
      1. Incremental Versus Full Loads
      2. Loading Data Extracted from a CDC Log
    3. Configuring a Snowflake Warehouse as a Destination
    4. Loading Data into a Snowflake Data Warehouse
    5. Using Your File Storage as a Data Lake
    6. Open Source Frameworks
    7. Commercial Alternatives
  7. 6. Transforming Data
    1. Noncontextual Transformations
      1. Deduplicating Records in a Table
      2. Parsing URLs
    2. When to Transform? During or After Ingestion?
    3. Data Modeling Foundations
      1. Key Data Modeling Terms
      2. Modeling Fully Refreshed Data
      3. Slowly Changing Dimensions for Fully Refreshed Data
      4. Modeling Incrementally Ingested Data
      5. Modeling Append-Only Data
      6. Modeling Change Capture Data
  8. 7. Orchestrating Pipelines
    1. Directed Acyclic Graphs
    2. Apache Airflow Setup and Overview
      1. Installing and Configuring
      2. Airflow Database
      3. Web Server and UI
      4. Scheduler
      5. Executors
      6. Operators
    3. Building Airflow DAGs
      1. A Simple DAG
      2. An ELT Pipeline DAG
    4. Additional Pipeline Tasks
      1. Alerts and Notifications
      2. Data Validation Checks
    5. Advanced Orchestration Configurations
      1. Coupled Versus Uncoupled Pipeline Tasks
      2. When to Split Up DAGs
      3. Coordinating Multiple DAGs with Sensors
    6. Managed Airflow Options
    7. Other Orchestration Frameworks
  9. 8. Data Validation in Pipelines
    1. Validate Early, Validate Often
      1. Source System Data Quality
      2. Data Ingestion Risks
      3. Enabling Data Analyst Validation
    2. A Simple Validation Framework
      1. Validator Framework Code
      2. Structure of a Validation Test
      3. Running a Validation Test
      4. Usage in an Airflow DAG
      5. When to Halt a Pipeline, When to Warn and Continue
      6. Extending the Framework
    3. Validation Test Examples
      1. Duplicate Records After Ingestion
      2. Unexpected Change in Row Count After Ingestion
      3. Metric Value Fluctuations
    4. Commercial and Open Source Data Validation Frameworks
  10. 9. Best Practices for Maintaining Pipelines
    1. Handling Changes in Source Systems
      1. Introduce Abstraction
      2. Maintain Data Contracts
      3. Limits of Schema-on-Read
    2. Scaling Complexity
      1. Standardizing Data Ingestion
      2. Reuse of Data Model Logic
      3. Ensuring Dependency Integrity
  11. 10. Measuring and Monitoring Pipeline Performance
    1. Key Pipeline Metrics
    2. Prepping the Data Warehouse
      1. A Data Infrastructure Schema
    3. Logging and Ingesting Performance Data
      1. Ingesting DAG Run History from Airflow
      2. Adding Logging to the Data Validator
    4. Transforming Performance Data
      1. DAG Success Rate
      2. DAG Runtime Change Over Time
      3. Validation Test Volume and Success Rate
    5. Orchestrating a Performance Pipeline
      1. The Performance DAG
    6. Performance Transparency
  12. Index

Product information

  • Title: Data Pipelines Pocket Reference
  • Author(s): James Densmore
  • Release date: February 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492087830