Chapter 17. Data Engineering for Autonomy and Rapid Innovation

Jeff Magnusson

In many organizations, data engineering is treated purely as a specialty. Data pipelines are seen as the complex, arcane domain of data engineers. Often data engineers are organized into dedicated teams, or embedded into vertically oriented product-based teams.

While delegating work to specialists often makes sense, it also implies that a handoff is required in order to accomplish something that spans beyond that specialty. Fortunately, with the right frameworks and infrastructure in place, handoffs are unnecessary to accomplish (and, perhaps more importantly, iterate on!) many data flows and tasks.

Data pipelines can generally be decomposed into business or algorithmic logic (metric computation, model training, featurization, etc.) and data-flow logic (complex joins, data wrangling, sessionization, etc.). Data engineers specialize in implementing data-flow logic, but often must implement other logic to spec based on the desires or needs of the team requesting the work, and without the ability to autonomously adjust those requirements.

This happens because both types of logic are typically intertwined and implemented hand-in-hand throughout the pipelines. Instead, look for ways to decouple data-flow logic from other forms of logic within the pipeline. Here are some strategies.

Implement Reusable Patterns ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.