Chapter 37. How Data Pipelines Evolve
Chris Heinzmann
In today’s world, there is so much data being generated and so much business value waiting to be discovered. How can a data engineer get that data efficiently into the hands of analysts and data scientists?
Enter the data pipeline. Historically, the standard business practice was to set up an ETL pipeline:
- Extract
- Take data from a source system, usually some sort of scheduler to execute code, called jobs.
- Transform
- Modify the data in some way—for example, ensure consistency in naming, provide accurate timestamps, perform basic data cleansing, or calculate baseline metrics.
- Load
- Save the data to a target system, usually a data warehouse.
The ETL pattern worked well for many years, and continues to work for thousands of companies. If it’s not broken, don’t fix it. However, traditional ETL can also be intimidating to get started with, and alternatives exist.
For early-stage businesses still navigating product/market fit, forgo the sophisticated pipeline. Questions will be too varied and answers needed too quickly. All that is required is a set of SQL scripts that run as a cron job against the production data at a low-traffic period, and a spreadsheet.
For a company in the middle of a growth stage, setting up an extract, load, transform (ELT) pipeline is appropriate. You will have plenty of unknowns and want to remain as agile ...
Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.