Chapter 6. Identifying Workflow and Pipeline Issues

From its beginning, this book has been about extracting value from your data. That journey to value sometimes requires your data to flow through many different systems and processings. Chapter 5 was all about the failures for a given process. This chapter steps back and looks at the bigger picture, asking how to identify issues at a pipeline level.

Typically, to extract value from data, you need a number of operations to happen first, tasks that include but aren’t limited to the following:

Join and enrich

Joining and enriching the data with external data sources.

Optimal persist

Getting the raw data in the common desired format. This could be storing the data in different storage systems for different optimal access patterns.

Feature generation

Enriching data with logic (human coded or artificial intelligence generated)

Toxinization/Mask

The act of removing personal data and replacing it with universally unique identifiers (UUIDs) to better protect the person the data relates to and to protect the company from legal issues.

Data marting

The action of joining, ordering, or filtering datasets to get them to a more consumable state.

To make things more difficult, the larger an organization is, the more separated these operations will be. Compound that by the fact that you will most likely have hundreds, if not thousands, of datasets with thousands of value-producing approaches and outcomes. That will leave you with many ...

Get Rebuilding Reliable Data Pipelines Through Modern Tools now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.