Testing Data Pipelines with Data Validation
Published by O'Reilly Media, Inc.
Safeguard data pipelines from unexpected failures
Data validation is the process of checking whether data follows the requirements needed for data pipelines to run reliably. It's used by data scientists and data engineers to preserve the integrity of existing workflows, thereby preventing application failures and costly mistakes. As more companies employ interconnected data pipelines, data validation will continue to grow in importance.
Join experts Han Wang and Kevin Kho to discover how to use the data validation process to test and monitor data pipelines. You’ll examine the various frameworks available to iterate the data validation process, including Great Expectations, pandera, and Fugue; how data validation differs between a single-machine setting and a distributed computing setting; and how to apply validations on different partitions of data when dealing with large-scale data.
What you’ll learn and how you can apply it
By the end of this live online course, you’ll understand:
- What data validation is and why it’s important
- Various frameworks for data validation and their trade-offs
- The difference between a map and aggregation, and how it affects validations
- The difficulties of validation on distributed computing environments
And you’ll be able to:
- Frame data validation problems in a clearer way
- Construct a data validation pipeline that catches failures
- Scale the validation pipeline to operate on Spark or Dask
- Create custom validations when built-in ones won’t suffice
- Perform different validations on logical groupings of data simultaneously (validation by partition)
This live event is for you because...
- You need to safeguard data pipelines from malformed data.
- You’re writing a validation function from scratch and want to expedite the process.
- You work with large or growing datasets and need to scale validations.
- You have heterogeneous data (think geographical differences) and want to apply different validations for each segment.
- You want to become a data scientist, data engineer, or machine learning engineer.
Prerequisites
- Familiarity with Python
- An understanding of pandas (manipulating DataFrames, groupby-aggregate semantics, operations such as min/max/median, and data types)
- Familiarity with Spark DataFrames, partitions, maps, and schemas (useful but not required)
Recommended preparation:
- No preparation or local installation needed—all exercises will be provided using Jupyter notebooks
- Watch Python Fundamentals (video)
- Read Python for Data Analysis, second edition (book)
Recommended follow-up:
- Explore the documentation for Fugue, pandera, and Great Expectations
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Course overview (10 minutes)
- Presentation: Jupyter Notebook overview; introduction to course use case
The need for data validation (15 minutes)
- Presentation: How production data pipelines can break, and the consequences thereof; the need for data validation; common data validations
- Q&A
Great Expectations (30 minutes)
- Presentation: Why use a data validation framework?; built-in validations; Great Expectations features (data validation documentation, connecting to data sources, CLI); using Great Expectations on Spark; associated pain points
- Jupyter notebook: Show Great Expectations Performance on Spark
- Q&A
- Break
Pandera (25 minutes)
- Presentation: Pandera’s advantages; using pandera
- Jupyter notebook: Use pandera as a Data Validation Framework
- Group discussion: Comparing pandera and Great Expectations
Distributed computing theory and Fugue (30 minutes)
- Presentation: Fugue as an abstraction layer; introduction to distributed compute (partitioning, persisting/caching, lazy evaluation); the need for distributed compute; pain points (learning a new syntax, uneven partitions, testability code); Fugue extensions; transformers and processors; using pandas libraries on Spark
- Q&A
- Break
Fugue applied (20 minutes)
- Jupyter notebook: Use Fugue to Use Python Code on Spark
- Q&A
Pandera on Spark (30 minutes)
- Presentation: Porting pandera to Spark with Fugue; limitations of pandas libraries on Spark (aggregations, performance, and advice on pickling for other libraries)
- Jupyter notebook: Use pandera and Fugue on Spark
- Q&A
- Break
Validation by partition (20 minutes)
- Presentation: Heterogeneous data and the need for different validations; why stand-alone validation frameworks can’t achieve this; combining Fugue and pandera for validation by partition
- Jupyter notebook: Put It All Together—Validation by Partition
- Q&A
Your Instructors
Han Wang
Han Wang is the tech lead of Lyft Machine Learning Platform, focusing on distributed computing and machine learning solutions. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon and Quantlab. Han is the founder of the Fugue project, aimed at democratizing distributed computing and machine learning.
Kevin Kho
Kevin Kho is an Open Source Community Engineer at Prefect, an open-source workflow orchestration management system. Previously, he was a data scientist at Paylocity, where he worked on adding machine learning features to their Human Capital Management (HCM) Suite. Outside of work, he is a contributor for Fugue, which is one of the SQL interfaces for Dask . He is also an organizer for the Orlando Machine Learning and Data Science Meetup.