Chapter 10. Performance Tuning: Optimizing Your Data Pipelines with Delta Lake
Up to this point, you’ve explored various ways of working with Delta Lake. You’ve seen many of the features that make Delta Lake a better and more reliable choice as a storage format for your data. Tuning your Delta Lake tables for performance, however, requires a solid understanding of the basic mechanics of table maintenance, which was covered in Chapter 5, as well as a bit of knowledge about and practice at manipulating or implementing some of the internal and advanced features introduced in Chapter 8. This performance side becomes the focus now, as we’ll look at the impact of pulling the levers of some of those features in a bit more detail. We encourage you to review the topics laid out in Chapter 5 if you have not recently used or reviewed them.
In general, you will often want to maximize reliability and the efficiency with which you can accomplish data creation, consumption, and maintenance tasks without adding unnecessary costs to your data processing pipelines. By taking the time to optimize your workloads properly, you can balance the overhead costs of these tasks with various performance considerations to align with your objectives. What you should be able to gain here is an understanding of how tuning some of the features you’ve already seen can help to achieve your objectives.
First, there’s some background work to provide a bit of clarity about the nature of your objectives. After that, ...
Get Delta Lake: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.