Chapter 5. Open Data Lakehouse Analytics

So far, you have learned how to connect Presto to a data lake using standard connectors such as MySQL and Pinot. In addition, you have learned how to write a custom connector using Presto’s Java classes and methods. Finally, you have connected a client to Presto to run generic or custom queries. Now it’s time to use Presto in an advanced, more realistic scenario that addresses the main challenges of big data management: table lookup, concurrent access to data, and access control.

In this chapter, we will give an overview of the data lakehouse and implement a practical scenario. The chapter is divided into two parts. In the first part, we introduce the architecture of a data lakehouse, focusing on its main components. In the second part of the chapter, you will implement a practical data lakehouse scenario using Presto and completely open components.

The Emergence of the Lakehouse

The first generation of data lakes, based primarily on the Hadoop Distributed File System (HDFS), demonstrated the promise of analytics at scale. As a result, many organizations formed data platform architectures consisting of data lakes and data warehouses, stitching pipelines and workflows between them. However, the resulting platform was very complex, with issues around reliability, data freshness, and cost.1

To overcome these issues, organizations tried to stretch both the data lake and the data warehouse in terms of the workloads they could support, but with ...

Get Learning and Operating Presto now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.