Chapter 2. Getting Started with Delta Lake

In the previous chapter we introduced Delta Lake and saw how it adds transactional guarantees, DML support, auditing, a unified streaming and batch model, schema enforcement, and a scalable metadata model to traditional data lakes.

In this chapter, we will go hands-on with Delta Lake. We will first set up Delta Lake on a local machine with Spark installed. We will run Delta Lake samples in two interactive shells:

  1. First, we will run the PySpark interactive shell with the Delta Lake packages. This will allow us to type in and run a simple two-line Python program that creates a Delta table.

  2. Next, we will run a similar program with the Spark Scala shell. Although we do not cover the Scala language extensively in this book, we want to demonstrate that both the Spark shell and Scala are options with Delta Lake.

Next, we will create a helloDeltaLake starter program in Python inside your favorite editor and run the program interactively in the PySpark shell. The environment we set up in this chapter, and the helloDeltaLake program, will be the basis for most other programs we create in this book.

Once the environment is up and running, we are ready to look deeper into the Delta table format. Since Delta Lake uses Parquet as the underlying storage medium, we first take a brief look at the Parquet format. Since partitions and partition files play an important role when we study the transaction log later, we will study the mechanism of both automatic ...

Get Delta Lake: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.