Chapter 6. Software Development Strategies

One of the foundational concepts in The Pragmatic Programmer by David Thomas and Andrew Hunt (Addison-Wesley) is that code should be Easy To Change (ETC),1 a concept I’ll expand on throughout this chapter. I find this advice to be especially relevant when working with data pipelines, where change is a way of life. When developing data pipelines you need to support changes in a multitude of areas: data size, format, and shape; data acquisition and storage; and evolving needs for data transformation and validation, not to mention changes in cloud services, providers, and data processing engines.

With all these vectors for change, even the best-intentioned codebases can turn into spaghetti, making it difficult to modify, extend, and test functionality. This in turn will negatively impact performance, reliability, and cost as more time and resources are required to debug and evolve the pipeline.

This chapter is about helping you design codebases that will be resilient to the shifting sands of data pipeline design, with a focus on developing code that is ETC.

To start, I’ll discuss some common coding environments you encounter in data pipelines and show you how to effectively manage code in each situation based on my experience developing across all these tools.

Then I’ll show you techniques for creating modular codebases, using best practices from software engineering applied to common scenarios when working with data pipelines. To set the ...

Get Cost-Effective Data Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.