Chapter 27. Effective Data Engineering in the Cloud World

Dipti Borkar

The cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premises deals only with databases and some parts of the Hadoop stack. In the cloud, things are a bit different.

Data engineers suddenly need to think differently and more broadly. Instead of being focused purely on data infrastructure, you are now almost a full-stack engineer (leaving out the final end application, perhaps). Skills are increasingly needed across the broader stack—compute, containers, storage, data movement, performance, network. Here are some design concepts and data stack elements to keep in mind.

Disaggregated Data Stack

Historically, databases were tightly integrated, with all core components built together. Hadoop changed that with colocated computing and storage in a distributed system instead of being in a single or a few boxes. Then the cloud changed that. Today, it is a fully disaggregated stack with each core element of the database management system being its own layer. Pick each component wisely.

Orchestrate, Orchestrate, Orchestrate

The cloud has created a need for and enabled mass orchestration—whether that’s Kubernetes for containers, Alluxio for data, Istio for APIs, Kafka for events, or Terraform ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.