Data Mesh in Practice—with Interactivity

Intermediate

How to set the foundations for federated data ownership

This live event utilizes Jupyter Notebook technology

The data lake paradigm is often considered the scalable successor of the more curated data warehouse approach when it comes to democratization of data. However, many who set out to build a centralized data lake came back with a data swamp of unclear responsibilities, a lack of data ownership, and subpar data availability.

Accessibility and availability can only be guaranteed at scale when moving more responsibilities to those who pick up the data and have the respective domain knowledge—the data owners —while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a data mesh.

Join experts Max Schultze and Arif Wider for a concise, comprehensive overview of the data mesh. You’ll learn how to tackle the challenges of decentralized data ownership and how to provide the right platform tooling that enables data owners to take over responsibility in a scalable and sustainable fashion. You’ll also discover how to provide data in such a way that others can create value from it, and explore the concept of a data product, which goes beyond sharing of files toward guarantees of quality and acknowledgement of data ownership.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

The consequences of unclear data ownership
What a scalable structure of domain-driven, federated responsibilities looks like
How a shared data infrastructure platform can contribute

And you’ll be able to:

Facilitate steps toward federated data ownership in your company
Provide data in such a way that others can create value from it
Support data ownership by providing domain-agnostic infrastructure tooling

This live event is for you because...

You’re a software or data engineer.
You work with data production, infrastructure, or consumption.
You want to become a data product owner.

Prerequisites

list text here- Familiarity with distributed data processing
A basic understanding of Python

Recommended preparation:

Read “The Trouble with Distributed Systems” and “Batch Processing” (chapters 8 and 10 in Designing Data-Intensive Applications)
Read “DataLake” (article)

Recommended follow-up:

Read “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” (article)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction to data mesh (25 minutes)

Presentation: What’s the data mesh paradigm?; Why was it invented?
Exercise: Jupyter Notebook setup

The data consumer perspective (45 minutes)

Exercise: Calculate a set of business KPIs from a prepared, fairly undocumented dataset
Presentation: Overview of data mesh—product thinking for data, domain-driven design applied to distributed data, and platform thinking for data infrastructure; issues on the consumer side
Q&A

Break (5 minutes)

The data producer perspective (45 minutes)

Presentation: What to do on the data producer side; how to create a data product; how to think about domain boundaries
Exercise: Rewrite the introduced dataset with a proper column description; create a schema and dataset description
Presentation: why building a good data product is hard
Q&A

Break (5 minutes)

The data infrastructure platform perspective (45 minutes)

Exercise: answer an access request by calling some prepared functions; answer repeatedly to many access requests
Presentation: What makes a good data infrastructure platform?—domain agnostic, self-service, etc.; the trap of taking centralized responsibility for data; platform thinking—multitenancy, how to enable interoperability, and how to stay out of domain responsibility
Demo: Build a platform capability / self service tool
Q&A

Conclusion and Wrap up (10 minutes)

Presentation: the goal state; Key learnings; what did we NOT talk about? Followup suggestions
Q&A

Your Instructors

Max Schultze
Max Schultze is an associate director of data engineering at the data platform of HelloFresh, the world's leading meal kit company. His focus lies on offering company-wide platform solutions around data infrastructure and governance. Previously he was working as an engineering manager at Zalando where he was building data pipelines at petabytes scale, productionizing distributed processing engines like Spark and Trino, and providing services and tooling for data management. As an early adopter of the data mesh paradigm, he is frequently advocating its usage through conference appearances, online trainings, and publications. Max originally graduated from Humboldt University of Berlin, actively taking part in the university’s initial development of Apache Flink.

linkedin search
Arif Wider
Arif Wider is a professor of software engineering at HTW Berlin, Germany, and a lead technology consultant with ThoughtWorks. At Thoughtworks, he worked with Zhamak Dehghani, who coined the term Data Mesh in 2019. Outside of teaching, Arif enjoys building scalable software that makes an impact, as well as building teams that create such software. More specifically, he is fascinated by applications of Artificial Intelligence and how effectively building such applications requires data scientists and developers (like himself) to work closely together.

linkedin search