Chapter 6. Transforming Data

In the ELT pattern defined in Chapter 3, once data has been ingested into a data lake or data warehouse (Chapter 4), the next step in a pipeline is data transformation. Data transformation can include both noncontextual manipulation of data and modeling of data with business context and logic in mind.

If the purpose of the pipeline is to produce business insight or analysis, then in addition to any noncontextual transformations, data is further transformed into data models. Recall from Chapter 2 that a data model structures and defines data in a format that is understood and optimized for data analysis. A data model is represented as one or more tables in a data warehouse.

Though data engineers at times build noncontextual transformation in a pipeline, it’s become typical for data analysts and analytics engineers to handle the vast majority of data transformations. People in these roles are more empowered than ever thanks to the emergence of the ELT pattern (they have the data they need right in the warehouse!) and supporting tools and frameworks designed with SQL as their primary language.

This chapter explores both noncontextual transformations that are common to nearly every data pipeline as well as data models that power dashboards, reports, and one-time analysis of a business problem. Because SQL is the language of the data analyst and analytics engineer, most transformation code samples are written in SQL. I include a few samples written in Python ...

Get Data Pipelines Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.