Chapter 2. Data Transformation

While data ingestion simply transfers data from point A to B, data transformation turns raw data into valuable insights through various stages of the data lifecycle. This chapter delves into the diverse languages, platforms, and technologies available to data practitioners for executing data transformations.

We’ll see how to ensure that data transformations are conducted efficiently and in a well-coordinated manner, laying the groundwork for more detailed discussions on efficiency, scalability, and observability later in the guide.

What Is Data Transformation?

Data transformation is the art of manipulating and enhancing data to better serve users and processes. Transformation involves taking some data, whether in a raw or nearly pristine state, and performing one or many operations to move it closer to the intended use. In an ETL pipeline, transformation occurs in not one, but many places. Data might be transformed upon ingestion and again at any number of points downstream. The goal of data transformation is to turn data into an asset—using analysis and science to create something of value for the business.

Transformation might be as simple as removing unwanted records, e.g., filtering, or as complex as restructuring the source data entirely. Transformation exists on a spectrum; there’s an almost infinite number of ways to transform data—that’s what keeps things interesting!

Similarly, transformation can be orchestrated in any language with any ...

Get Understanding ETL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.