Chapter 32. Focus on Maintainability and Break Up Those ETL Tasks

Chris Moradi

As the data science tent widens, practitioners may excel at using prepared data but lack the skills to do this preparation in a reliable way. These responsibilities can be split across multiple roles and teams, but enormous, productive gains can be achieved by taking a full-stack approach, in which data scientists own the entire process from ideation through deployment.

Whether you’re a data scientist building your own ETLs or a data engineer assisting data scientists in this process, making your data pipelines easier to understand, debug, and extend will reduce the support burden for yourself and your teammates. This will facilitate iteration and innovation in the future.

The primary way to make ETLs more maintainable is to follow basic software engineering best practices and break the processing into small and easy-to-understand tasks that can be strung together—preferably with a workflow engine. Small ETL tasks are easier for new contributors and maintainers to understand, they’re easier to debug, and they allow for greater code reuse.

Doing too much in a processing step is a common pitfall for both the inexperienced and the highly experienced. With less experience, it can be hard to know how to decompose a large workflow into small, well-defined transformations. If you’re relatively new to building ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.