Delta Lake: The Definitive Guide

Book description

Discover how Delta Lake simplifies the process of building data lakehouses and data pipelines at scale. With this practical guide, data engineers, data scientists, and data analysts will explore key data reliability challenges and learn to apply modern data engineering and management techniques. You'll also understand how ACID transactions bring reliability to data lakehouses at scale.

This book helps you:

  • Understand key data reliability challenges
  • Examine data management and engineering techniques using the modern data stack
  • Realize data reliability improvements using Delta Lake
  • Concurrently run streaming and batch jobs against your data lake
  • Execute update, delete, and merge commands
  • Use time travel to rollback and examine previous versions of your data
  • Build a streaming data quality pipeline following the medallion construct

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. 1. Installing Delta Lake
    1. Delta Lake Docker Image
      1. Choose an Interface
    2. Native Delta Lake Libraries
      1. Various bindings available
      2. Installation
    3. Apache Spark with Delta Lake
      1. Setting up Delta Lake with Apache Spark
      2. Prerequisite: set up Java
      3. Set up an interactive shell
    4. PySpark Declarative API
    5. Databricks Community Edition
      1. Create a Cluster with Databricks Runtime
      2. Importing notebooks
      3. Attaching Notebooks
    6. Summary
  3. 2. Diving into the Delta Lake Ecosystem
    1. Connectors
    2. Apache Flink
      1. Flink DataStream Connector
      2. Installing the Connector
      3. DeltaSource API
      4. DeltaSink API
      5. End-to-End Example
    3. Kafka Delta Ingest
      1. Using the Connector
    4. Trino
      1. Getting Started
      2. Configuring and Using the Trino Connector
      3. Using Show Catalogs
      4. Creating a Schema
      5. Show Schemas
      6. Working with Tables
      7. Table Operations
    5. Summary
  4. 3. Maintaining Your Delta Lake
    1. Using Delta Lake Table Properties
      1. Create an Empty Table with Properties
      2. Populate the Table
      3. Evolve the Table Schema
      4. Add or Modify Table Properties
      5. Remove Table Properties
    2. Delta Table Optimization
      1. The Problem with Big Tables and Small Files
      2. Using Optimize to Fix the Small File Problem
    3. Table Tuning and Management
      1. Partitioning your Tables
      2. Defining Partitions on Table Creation
      3. Migrating from a Non-Partitioned to Partitioned Table
    4. Repairing, Restoring, and Replacing Table Data
      1. Recovering and Replacing Tables
      2. Deleting Data and Removing Partitions
      3. The Lifecycle of a Delta Lake Table
      4. Restoring your Table
      5. Cleaning Up
    5. Summary
  5. 4. Streaming In and Out of Your Delta Lake
    1. Streaming and Delta Lake
      1. Streaming vs Batch Processing
      2. Delta as Source
      3. Delta as Sink
    2. Delta streaming options
      1. Limit the Input Rate
      2. Ignore Updates or Deletes
      3. Initial Processing Position
      4. Initial Snapshot with EventTimeOrder
    3. Advanced Usage with Apache Spark
      1. Idempotent Stream Writes
      2. Delta Lake Performance Metrics
    4. Auto Loader and Delta Live Tables
      1. Autoloader
      2. Delta Live Tables
    5. Change Data Feed
      1. Using Change Data Feed
      2. Schema
    6. Additional Thoughts
    7. Key References
  6. 5. Architecting Your Lakehouse
    1. The Lakehouse Architecture
      1. What is a Lakehouse?
      2. Learning from Data Warehouses
      3. Learning from Data Lakes
      4. The Dual-Tier Data Architecture
      5. Lakehouse Architecture
    2. Foundations with Delta Lake
      1. Open-Source on Open-Standards in an Open Ecosystem
      2. Transaction Support
      3. Schema Enforcement and Governance
    3. The Medallion Architecture
      1. Exploring the Bronze Layer
      2. Exploring the Silver Layer
      3. Exploring the Gold Layer
    4. Streaming Medallion Architecture
      1. Reducing End to End Latency within your Lakehouse
    5. Summary
  7. 6. Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
    1. Performance Objectives
      1. Maximizing read performance
      2. Maximizing write performance
    2. Performance Considerations
      1. Partitioning
      2. Table Utilities
      3. Table Statistics
      4. Cluster By
      5. Bloom Filter Index
    3. Conclusion
  8. 7. Successful Design Patterns
    1. Slashing Compute Costs
      1. High-Speed Solutions
      2. Smart Device Integration
    2. Efficient Streaming Ingestion
      1. Streaming Ingestion
      2. The Inception of Delta Rust
    3. Coordinating Complex Systems
      1. Combining Operational Data Stores at Doordash
    4. Conclusion
    5. References
      1. Comcast
      2. Scribd
      3. Doordash
  9. 8. Lakehouse Governance & Security
    1. Lakehouse Governance
      1. The Facets of Lakehouse Governance
    2. The Emergence of Data Governance
      1. Data Products and their Relationship to Data Assets
      2. Data Products in the Lakehouse
    3. Data Assets and Access
      1. The Data Asset Model
    4. Unifying Governance between Data Warehouses and Lakes
      1. Permissions Management
      2. File System Permissions
      3. Cloud Object Store Access Controls
      4. Data Security
    5. Metadata Management
      1. What is Metadata Management?
      2. Data Catalogs
    6. Data Flow and Lineage
      1. Data Lineage
      2. Data Sharing
      3. Automating Data Lifecycles
      4. Audit Logging
      5. Monitoring and Alerting
      6. What is Data Discovery?
    7. Summary

Product information

  • Title: Delta Lake: The Definitive Guide
  • Author(s): Denny Lee, Prashanth Babu, Tristen Wentling, Scott Haines
  • Release date: November 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098151942