Chapter 8. Data Validation in Pipelines
Even in the best designed data pipeline, something is bound to go wrong. Many issues can be avoided, or at least mitigated, with good design of processes, orchestration, and infrastructure. To ensure the quality of and validity of the data itself, however, you’ll need to invest in data validation. It’s best to assume that untested data is not safe to use in analytics. This chapter discusses the principles of data validation throughout the steps of an ELT pipeline.
Validate Early, Validate Often
Though well intentioned, some data teams leave data validation to the end of a pipeline and implement some kind of validation during transformation or even after all transformations are complete. In this design, they are working with the idea that the data analysts (who typically own the transform logic) are best suited to make sense of the data and determine if there are any quality issues.
In such a design, the data engineers focus on moving data from one system to another, orchestrating pipelines, and maintaining the data infrastructure. Although that’s the role of a data engineer, there’s one thing missing: by ignoring the content of the data flowing through each step in the pipeline, they are putting trust in the owners of the source systems they ingest from, their own ingestion processes, and the analysts who transform the data. As efficient as such separation of responsibilities sounds, it’s likely to end with low data quality and an inefficient ...
Get Data Pipelines Pocket Reference now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.