Chapter 5. Performance Tuning
Any time you are storing and retrieving data, whether with a traditional RDBMS or with Delta tables, how you organize the data in the underlying storage format can significantly affect the time it takes to perform table operations and queries. In general, performance tuning refers to the process of optimizing the performance of a system, and in the context of Delta tables this involves optimizing how the data is stored and retrieved. Historically, retrieving data is accomplished by either increasing RAM or CPU for faster processing, or reducing the amount of data that needs to be read by skipping nonrelevant data. Delta Lake provides a number of different techniques that can be combined to accelerate data retrieval by efficiently reducing the amount of files and data that needs to be read during operations.
An additional problem that can contribute to slower reads and inefficient processing in Apache Spark and Delta Lake is the small file problem, briefly mentioned in Chapter 1. The small file problem is an issue that can arise when the underlying data files are divided into numerous small files, as opposed to larger, more efficient files. It can occur for several different reasons, primarily due to frequent writes, but can be addressed through a variety of techniques in Delta Lake that include compacting small files into larger files.
By leveraging good performance tuning strategies to reduce the effects of the small file problem and better enable ...
Get Delta Lake: Up and Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.