Chapter 5. Iceberg Catalogs

In this chapter, we’ll dive into Iceberg catalogs. You’ve seen how a catalog is a critical component of Iceberg that allows it to ensure consistency with multiple readers and writers and to discover what tables are available in the environment. In this chapter, we’ll cover:

  • The requirements of a catalog in general, and additional requirements recommended for the use of a catalog in production

  • The different catalog implementations, including pros, cons, and how to configure Spark to use the catalog

  • In what situations you may want to consider migrating catalogs

  • How to migrate from one catalog to another

Requirements of an Iceberg Catalog

Iceberg provides a catalog interface that requires the implementation of a set of functions, primarily ones to list existing tables, create tables, drop tables, check whether a table exists, and rename tables.

As it is an interface, it has multiple implementations, including Hive Metastore, AWS Glue, and a filesystem catalog (Hadoop). In addition to the requirement of implementing the functions defined in the interface, the primary high-level requirement for a catalog implementation to work as an Iceberg catalog is to map a table path (e.g., db1.table1) to the filepath of the metadata file that has the table’s current state.

Since this is a generic requirement and there are a variety of catalog implementations with each system having inherent differences as to how they store data, different catalogs do this mapping ...

Get Apache Iceberg: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.