Delta Lake: The Definitive Guide

Book description

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques.

Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale.

This book helps you:

  • Understand key data reliability challenges and how Delta Lake solves them
  • Explain the critical role of Delta transaction logs as a single source of truth
  • Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino
  • Architect data lakehouses with the medallion architecture
  • Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword by Michael Armbrust
  2. Foreword by Dominique Brezinski
  3. Preface
    1. Who This Book Is For
    2. How This Book Is Organized
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
      1. Denny
      2. Tristen
      3. Scott
      4. Prashanth
  4. 1. Introduction to the Delta Lake Lakehouse Format
    1. The Genesis of Delta Lake
      1. Data Warehousing, Data Lakes, and Data Lakehouses
      2. Project Tahoe to Delta Lake: The Early Years Months
    2. What Is Delta Lake?
      1. Common Use Cases
      2. Key Features
    3. Anatomy of a Delta Lake Table
    4. Delta Transaction Protocol
      1. Understanding the Delta Lake Transaction Log at the File Level
      2. The Single Source of Truth
      3. The Relationship Between Metadata and Data
      4. Multiversion Concurrency Control (MVCC) File and Data Observations
      5. Observing the Interaction Between the Metadata and Data
      6. Table Features
    5. Delta Kernel
    6. Delta UniForm
    7. Conclusion
  5. 2. Installing Delta Lake
    1. Delta Lake Docker Image
      1. Delta Lake for Python
      2. PySpark Shell
      3. JupyterLab Notebook
      4. Scala Shell
      5. Delta Rust API
      6. ROAPI
    2. Native Delta Lake Libraries
      1. Multiple Bindings Available
      2. Installing the Delta Lake Python Package
    3. Apache Spark with Delta Lake
      1. Setting Up Delta Lake with Apache Spark
      2. Prerequisite: Set Up Java
      3. Setting Up an Interactive Shell
    4. PySpark Declarative API
    5. Databricks Community Edition
      1. Create a Cluster with Databricks Runtime
      2. Importing Notebooks
      3. Attaching Notebooks
    6. Conclusion
  6. 3. Essential Delta Lake Operations
    1. Create
      1. Creating a Delta Lake Table
      2. Loading Data into a Delta Lake Table
      3. The Transaction Log
    2. Read
      1. Querying Data from a Delta Lake Table
      2. Reading with Time Travel
    3. Update
    4. Delete
      1. Deleting Data from a Delta Lake Table
      2. Overwriting Data in a Delta Lake Table
    5. Merge
    6. Other Useful Actions
      1. Parquet Conversions
      2. Delta Lake Metadata and History
    7. Conclusion
  7. 4. Diving into the Delta Lake Ecosystem
    1. Connectors
    2. Apache Flink
      1. Flink DataStream Connector
      2. Installing the Connector
      3. DeltaSource API
      4. DeltaSink API
      5. End-to-End Example
    3. Kafka Delta Ingest
      1. Install Rust
      2. Build the Project
      3. Run the Ingestion Flow
    4. Trino
      1. Getting Started
      2. Configuring and Using the Trino Connector
      3. Using Show Catalogs
      4. Creating a Schema
      5. Show Schemas
      6. Working with Tables
      7. Table Operations
    5. Conclusion
  8. 5. Maintaining Your Delta Lake
    1. Using Delta Lake Table Properties
      1. Delta Lake Table Properties Reference
      2. Create an Empty Table with Properties
      3. Populate the Table
      4. Evolve the Table Schema
      5. Add or Modify Table Properties
      6. Remove Table Properties
    2. Delta Lake Table Optimization
      1. The Problem with Big Tables and Small Files
      2. Using OPTIMIZE to Fix the Small File Problem
    3. Table Tuning and Management
      1. Partitioning Your Tables
      2. Defining Partitions on Table Creation
      3. Migrating from a Nonpartitioned to a Partitioned Table
    4. Repairing, Restoring, and Replacing Table Data
      1. Recovering and Replacing Tables
      2. Deleting Data and Removing Partitions
      3. The Life Cycle of a Delta Lake Table
      4. Restoring Your Table
      5. Cleaning Up
    5. Conclusion
  9. 6. Building Native Applications with Delta Lake
    1. Getting Started
      1. Python
      2. Rust
      3. Building a Lambda
    2. What’s Next
  10. 7. Streaming In and Out of Your Delta Lake
    1. Streaming and Delta Lake
      1. Streaming Versus Batch Processing
      2. Delta as Source
      3. Delta as Sink
    2. Delta Streaming Options
      1. Limit the Input Rate
      2. Ignore Updates or Deletes
      3. Initial Processing Position
      4. Initial Snapshot with withEventTimeOrder
    3. Advanced Usage with Apache Spark
      1. Idempotent Stream Writes
      2. Delta Lake Performance Metrics
    4. Auto Loader and Delta Live Tables
      1. Auto Loader
      2. Delta Live Tables
    5. Change Data Feed
      1. Using Change Data Feed
      2. Schema
    6. Conclusion
  11. 8. Advanced Features
    1. Generated Columns, Keys, and IDs
    2. Comments and Constraints
      1. Comments
      2. Delta Table Constraints
    3. Deletion Vectors
      1. Merge-on-Read
      2. Stepping Through Deletion Vectors
    4. Conclusion
  12. 9. Architecting Your Lakehouse
    1. The Lakehouse Architecture
      1. What Is a Lakehouse?
      2. Learning from Data Warehouses
      3. Learning from Data Lakes
      4. The Dual-Tier Data Architecture
      5. Lakehouse Architecture
    2. Foundations with Delta Lake
      1. Open Source on Open Standards in an Open Ecosystem
      2. Transaction Support
      3. Schema Enforcement and Governance
    3. The Medallion Architecture
      1. Exploring the Bronze Layer
      2. Exploring the Silver Layer
      3. Exploring the Gold Layer
    4. Streaming Medallion Architecture
    5. Conclusion
  13. 10. Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
    1. Performance Objectives
      1. Maximizing Read Performance
      2. Maximizing Write Performance
    2. Performance Considerations
      1. Partitioning
      2. Table Utilities
      3. Table Statistics
      4. Cluster By
      5. Bloom Filter Index
    3. Conclusion
  14. 11. Successful Design Patterns
    1. Slashing Compute Costs
      1. High-Speed Solutions
      2. Smart Device Integration
    2. Efficient Streaming Ingestion
      1. Streaming Ingestion
      2. The Inception of Delta Rust
      3. The Evolution of Ingestion
    3. Coordinating Complex Systems
      1. Combining Operational Data Stores at DoorDash
      2. Change Data Capture
      3. Delta and Flink in Harmony
    4. Conclusion
  15. 12. Foundations of Lakehouse Governance and Security
    1. Lakehouse Governance
    2. The Emergence of Data Governance
      1. Data Products and Their Relationship to Data Assets
      2. Data Products in the Lakehouse
      3. Maintaining High Trust
    3. Data Assets and Access
    4. The Data Asset Model
    5. Unifying Governance Between Data Warehouses and Lakes
      1. Permissions Management
      2. Filesystem Permissions
      3. Cloud Object Store Access Controls
      4. Identity and Access Management
      5. Data Security
      6. Fine-Grained Access Controls for the Lakehouse
    6. Conclusion
  16. 13. Metadata Management, Data Flow, and Lineage
    1. Metadata Management
      1. What Is Metadata Management?
      2. Data Catalogs
      3. Data Reliability, Stewards, and Permissions Management
      4. Why the Metastore Matters
      5. Unity Catalog
    2. Data Flow and Lineage
      1. Data Lineage
      2. Data Sharing
      3. Automating Data Life Cycles
      4. Audit Logging
      5. Monitoring and Alerting
      6. What Is Data Discovery?
    3. Conclusion
  17. 14. Data Sharing with the Delta Sharing Protocol
    1. The Basics of Delta Sharing
      1. Data Providers
      2. Data Recipients
    2. Delta Sharing Server
      1. Using the REST APIs
      2. Anatomy of the REST URI
      3. List Shares
      4. Get Share
      5. List Schemas in Share
      6. List All Tables in Share
    3. Delta Sharing Clients
      1. Delta Sharing with Apache Spark
      2. Stream Processing with Delta Shares
      3. Delta Sharing Community Connectors
    4. Conclusion
  18. Index
  19. About the Authors

Product information

  • Title: Delta Lake: The Definitive Guide
  • Author(s): Denny Lee, Tristen Wentling, Scott Haines, Prashanth Babu
  • Release date: October 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098151942