Chapter 6. Apache Spark

Apache Spark stands out as a highly versatile distributed compute engine paired with Apache Iceberg due to its support for an extensive range of features. Leveraging Spark and Iceberg allows you to take advantage of the computational benefits of Iceberg’s efficient data organization and management capabilities. In this chapter, we will explore the necessary steps to get started with Apache Iceberg and Spark as well as dive into some critical capabilities. By the end of this chapter, you will be able to configure Apache Iceberg; perform various Data Definition Language (DDL) operations (CREATE, ALTER), queries (SELECT), and Data Manipulation Language (DML) operations (INSERT, UPDATE, DELETE, MERGE); and manage Iceberg tables with different processing engines.

Configuration

We’ll start by discussing how to configure Apache Iceberg tables and catalogs using Spark as the compute engine. The idea is to familiarize yourself with the basic configuration parameters needed to work with Iceberg and Spark seamlessly.

Configuring Apache Iceberg and Spark

To begin working with Apache Iceberg tables using Apache Spark, it’s necessary to configure them to work together. There are a couple of ways to define these configurations. First you will see how to set these configs via feature flags for use in Spark Shell or Spark SQL, and then you will see how to do the same in a Python application.

Configuring via the CLI

As a first step, you’ll need to specify the required ...

Get Apache Iceberg: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.