Chapter 5. Iceberg Catalogs
In this chapter, we’ll dive into Iceberg catalogs. You’ve seen how a catalog is a critical component of Iceberg that allows it to ensure consistency with multiple readers and writers and to discover what tables are available in the environment. In this chapter, we’ll cover:
The requirements of a catalog in general, and additional requirements recommended for the use of a catalog in production
The different catalog implementations, including pros, cons, and how to configure Spark to use the catalog
In what situations you may want to consider migrating catalogs
How to migrate from one catalog to another
Requirements of an Iceberg Catalog
Iceberg provides a catalog interface that requires the implementation of a set of functions, primarily ones to list existing tables, create tables, drop tables, check whether a table exists, and rename tables.
As it is an interface, it has multiple implementations, including Hive Metastore, AWS Glue, and a filesystem catalog (Hadoop). In addition to the requirement of implementing the functions defined in the interface, the primary high-level requirement for a catalog implementation to work as an Iceberg catalog is to map a table path (e.g., db1.table1
) to the filepath of the metadata file that has the table’s current state.
Since this is a generic requirement and there are a variety of catalog implementations with each system having inherent differences as to how they store data, different catalogs do this mapping ...
Get Apache Iceberg: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.