Simplifying Data Engineering and Analytics with Delta

Book description

Explore how Delta brings reliability, performance, and governance to your data lake and all the AI and BI use cases built on top of it

Key Features

  • Learn Delta's core concepts and features as well as what makes it a perfect match for data engineering and analysis
  • Solve business challenges of different industry verticals using a scenario-based approach
  • Make optimal choices by understanding the various tradeoffs provided by Delta

Book Description

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases.

In this book, you'll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You'll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you'll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.

By the end of this Delta book, you'll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

What you will learn

  • Explore the key challenges of traditional data lakes
  • Appreciate the unique features of Delta that come out of the box
  • Address reliability, performance, and governance concerns using Delta
  • Analyze the open data format for an extensible and pluggable architecture
  • Handle multiple use cases to support BI, AI, streaming, and data discovery
  • Discover how common data and machine learning design patterns are executed on Delta
  • Build and deploy data and machine learning pipelines at scale using Delta

Who this book is for

Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

Table of contents

  1. Simplifying Data Engineering and Analytics with Delta
  2. Foreword
  3. Contributors
  4. About the author
  5. About the reviewer
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Share Your Thoughts
  7. Section 1 – Introduction to Delta Lake and Data Engineering Principles
  8. Chapter 1: Introduction to Data Engineering
    1. The motivation behind data engineering
      1. Use cases
      2. How big is big data?
      3. But isn't ML and AI all the rage today?
    2. Understanding the role of data personas
    3. Big data ecosystem
      1. What characterizes big data?
      2. Classifying data
      3. Reaping value from data
      4. Top challenges of big data systems
    4. Evolution of data systems
      1. Rise of cloud data platforms
      2. SQL and NoSQL systems
      3. OLTP and OLAP systems
        1. Data platform service models
    5. Distributed computing
      1. SMP and MPP computing
      2. Parallel and distributed computing
        1. Hadoop
        2. Spark
        3. Hadoop versus Spark
    6. Business justification for tech spending
      1. Strategy for business transformation to use data as an asset
      2. Big data trends and best practices
    7. Summary
  9. Chapter 2: Data Modeling and ETL
    1. Technical requirements
    2. What is data modeling and why should you care?
      1. Advantages of a data modeling exercise
      2. Stages of data modeling
      3. Data modeling approaches for different data stores
        1. Relational data modeling
        2. Non-relational data modeling
    3. Understanding metadata – data about data
      1. Data catalog
      2. Types of metadata
      3. Why is metadata management the nerve center of data?
    4. Moving and transforming data using ETL
      1. Scenarios to consider for building ETL pipelines
        1. Periodic and continuous ingestion
        2. Bulk data migration
        3. Change data capture
        4. Slowly changing dimensions
      2. Job orchestration
    5. How to choose the right data format
      1. Text format versus binary format
      2. Row versus column formats
      3. When to use which format
      4. Leveraging data compression
    6. Common big data design patterns
      1. Ingestion
        1. Unified API
        2. Speed layer
      2. Transformations
        1. Handling schema changes
        2. ACID transactions
        3. Multihop pipeline
      3. Persist
        1. Separation of compute from storage
        2. Multiple destinations
        3. Denormalization
        4. In-stream analytics
        5. Best practices
    7. Summary
    8. Further reading
  10. Chapter 3: Delta – The Foundation Block for Big Data
    1. Technical requirements
    2. Motivation for Delta
      1. A case of too many is too little
      2. Data silos to data swamps
      3. Characteristics of curated data lakes
      4. DDL commands
        1. CREATE
      5. DML commands
      6. APPEND
        1. UPDATE
        2. DELETE
        3. MERGE
    3. Demystifying Delta
      1. Format layout on disk
    4. The main features of Delta
      1. ACID transaction support
      2. Schema evolution
      3. Unifying batch and streaming workloads
      4. Time travel
      5. Performance
        1. Data skipping
        2. Z-Order clustering
        3. Delta cache
    5. Life with and without Delta
      1. Lakehouse
        1. Characteristics of a Lakehouse
    6. Summary
  11. Section 2 – End-to-End Process of Building Delta Pipelines
  12. Chapter 4: Unifying Batch and Streaming with Delta
    1. Technical requirements
    2. Moving toward real-time systems
      1. Streaming concepts
      2. Lambda versus Kappa architectures
    3. Streaming ETL
      1. Extract – file-based versus event-based streaming
      2. Transforming – stream processing
      3. Loading – persisting the stream
    4. Handling streaming scenarios
      1. Joining with other static and dynamic datasets
      2. Recovering from failures
      3. Handling late-arriving data
      4. Stateless and stateful in-stream operations
    5. Trade-offs in designing streaming architectures
      1. Cost trade-offs
      2. Handling latency trade-offs
      3. Data reprocessing
      4. Multi-tenancy
      5. De-duplication
    6. Streaming best practices
    7. Summary
  13. Chapter 5: Data Consolidation in Delta Lake
    1. Technical requirements
    2. Why consolidate disparate data types?
    3. Delta unifies all types of data
      1. Structured data
      2. Semi-structured data
      3. Unstructured data
    4. Avoiding patches of data darkness
      1. Addressing problems in flight status using Delta
      2. Augmenting domain knowledge constraints to quality
      3. Continuous quality monitoring
    5. Curating data in stages for analytics
      1. RDD, DataFrames, and datasets
      2. Spark transformations and actions
      3. Spark APIs and UDFs
    6. Ease of extending to existing and new use cases
      1. Delta Lake connectors
      2. Specialized Delta Lakes by industry
        1. Healthcare and life sciences Delta Lake
        2. Industry 4.0 manufacturing Delta Lake
        3. Financial services Delta Lake
        4. Retail Delta Lake
    7. Data governance
      1. GDPR and CCPA compliance
      2. Role-based data access
    8. Summary
  14. Chapter 6: Solving Common Data Pattern Scenarios with Delta
    1. Technical requirements
    2. Understanding use case requirements
    3. Minimizing data movement with Delta time travel
    4. Delta cloning
    5. Handling CDC
      1. CDC
      2. Change Data Feed (CDF)
    6. Handling Slowly Changing Dimensions (SCD)
      1. SCD Type 1
      2. SCD Type 2
    7. Summary
  15. Chapter 7: Delta for Data Warehouse Use Cases
    1. Technical requirements
    2. Choosing the right architecture
    3. Understanding what a data warehouse really solves
      1. Lacunas of data warehouses
    4. Discovering when a data lake does not suffice
    5. Addressing concurrency and latency requirements with Delta
    6. Visualizing data using BI reporting
      1. Can cubes be constructed with Delta?
    7. Analyzing tradeoffs in a push versus pull data flow
      1. Why is being open such a big deal?
    8. Considerations around data governance
    9. The rise of the lakehouse category
    10. Summary
  16. Chapter 8: Handling Atypical Data Scenarios with Delta
    1. Technical requirements
    2. Emphasizing the importance of exploratory data analysis (EDA)
      1. From big data to good data
      2. Data profiling
      3. Statistical analysis
    3. Applying sampling techniques to address class imbalance
      1. How to detect and address imbalance
      2. Synthetic data generation to deal with data imbalance
    4. Addressing data skew
    5. Providing data anonymity
    6. Handling bias and variance in data
      1. Bias versus variance
      2. How do we detect bias and variance?
      3. How do we fix bias and variance?
    7. Compensating for missing and out-of-range data
    8. Monitoring data drift
    9. Summary
  17. Chapter 9: Delta for Reproducible Machine Learning Pipelines
    1. Technical requirements
    2. Data science versus machine learning
    3. Challenges of ML development
    4. Formalizing the ML development process
      1. What is a model?
      2. What is MLOps?
      3. Aspirations of a modern ML platform
    5. The role of Delta in an ML pipeline
      1. Delta-backed feature store
      2. Delta-backed model training
      3. Delta-backed model inferencing
      4. Model monitoring with Delta
    6. From business problem to insight generation
    7. Summary
  18. Chapter 10: Delta for Data Products and Services
    1. Technical requirements
    2. DaaS
    3. The need for data democratization
    4. Delta for unstructured data
      1. NLP data (text and audio)
      2. Image and video data
    5. Data mashups using Delta
      1. Data blending
      2. Data harmonization
      3. Federated query
    6. Facilitating data sharing with Delta
      1. Setting up Delta sharing
      2. Benefits of Delta sharing
      3. Data clean room
    7. Summary
  19. Section 3 – Operationalizing and Productionalizing Delta Pipelines
  20. Chapter 11: Operationalizing Data and ML Pipelines
    1. Technical requirements
    2. Why operationalize?
    3. Understanding and monitoring SLAs
    4. Scaling and high availability
    5. Planning for DR 
      1. How to decide on the correct DR strategy
      2. How Delta helps with DR
    6. Guaranteeing data quality
    7. Automation of CI/CD pipelines 
      1. Code under version control
      2. Infrastructure as Code (IaC)
      3. Unit and integration testing
    8. Data as code – An intelligent pipeline
    9. Summary
  21. Chapter 12: Optimizing Cost and Performance with Delta
    1. Technical requirements
    2. Improving performance with common strategies
      1. Where to look and what to look for
    3. Optimizing with Delta
      1. Changing the data layout in storage
      2. Other platform optimizations
      3. Automation
    4. Is cost always inversely proportional to performance?
    5. Best practices for managing performance
    6. Summary
  22. Chapter 13: Managing Your Data Journey
    1. Provisioning a multi-tenant infrastructure
    2. Data democratization via policies and processes
    3. Capacity planning
    4. Managing and monitoring
    5. Data sharing
    6. Data migration
    7. COE best practices
    8. Summary
    9. Why subscribe?
  23. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Simplifying Data Engineering and Analytics with Delta
  • Author(s): Anindita Mahapatra, Doug May
  • Release date: July 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781801814867