Preface

The goal of this book is to provide data practitioners with practical instructions on how to set up Delta Lake and start using its unique features. This book is designed for an audience that fits any of the following profiles:

  • Data practitioners with a Spark background

  • Data practitioners unfamiliar with or new to Delta Lake needing an introduction to the technology, the problems it solves, its main features and terminology, as well as how to get started using it

  • Data practitioners looking to learn about the features and benefits of modern lakehouse architectures

It is important to note that this book and the features discussed apply to the Delta Lake open source framework (Delta Lake OSS). Proprietary features and optimizations that some companies offer around Delta Lake are considered out of the scope of this book.

First, we discuss why Delta Lake is an important tool for building modern enterprise data platforms and data science and AI solutions, followed by instructions on how to set up Delta Lake with Spark. Each of the subsequent chapters will walk you through the fundamental functions and operations of Delta Lake using step-by-step instructions and real-world examples.

The code examples in the book range from snippets that can be used in a PySpark shell to those designed to be run with a complete end-to-end notebook. In this book, all code snippets will be in Python, SQL, and, where necessary, shell commands.

A GitHub repository is provided to aid readers in following along throughout the book. Datasets, files, and code samples are provided in the repo and referred to throughout the book. Below are some important things to note about using the GitHub repo:

Code samples

Code samples are organized in the repo by chapter, and for most chapters a chapter initialization script is intended to be executed before executing any of the related code for that particular chapter. This chapter initialization code is required before executing code in order to set up the appropriate Delta tables and datasets to best demonstrate the topics being discussed. These chapter initialization scripts are explicitly called out in the text of the book before executing the first set of sample code for a given chapter.

Code sample data files

Data files required to execute the provided code samples live in the GitHub repository. The data files in the GitHub repo come from the popular NYC Yellow and Green taxi trip records. These files were downloaded and curated for effective demonstration throughout this book.

Method for running Delta Lake for this book

The method for running Delta Lake for the purposes of this book and the code in the provided GitHub repo is Databricks Community Edition. Databricks Community Edition was chosen to develop and run the code samples because it is free, simplifies setup of Spark and Delta Lake, and does not require your own cloud account or for you to supply cloud compute or storage resources. The Delta tables, datasets, and code samples used in this book and the GitHub repo were developed and tested on Databricks Community Edition hosted on Azure, using Azure Data Lake Storage Gen2 as the underlying storage layer and Databricks Runtime 12.2 LTS. Please note that if you are running the code samples on Spark and Delta Lake outside of Databricks (e.g., on your local machine), then there will be additional setup, configuration, and potential editor syntax options to be accounted for by the reader.

Notebooks

You will also see the term notebook. A notebook refers to a Databricks notebook, the primary tool for developing code and presenting results throughout the book.

Code languages

Delta Lake supports multiple languages (Scala, Java, Python, and SQL) for a variety of functionality. This book will focus primarily on Python and SQL. Code samples will provide code in the language deemed most appropriate to the topic being discussed. Alternatives for similar functionality in other languages will not always be provided. Please refer to the Delta Lake documentation to view similar functionality in alternative languages.

For code snippets used throughout this book, the default language is Python. To indicate use of a language other than Python in a code snippet, you will see language magic commands, that is, %<language> (e.g., %sql). You can assume that code snippets without a language magic command are using Python.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/delta-lake-up-and-running-1e.

For news and information about our books and courses, visit https://oreilly.com.

Find us on LinkedIn: https://linkedin.com/company/oreilly-media.

Follow us on Twitter: https://twitter.com/oreillymedia.

Watch us on YouTube: https://youtube.com/oreillymedia.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

Tip

This element signifies a tip or suggestion.

Note

This element signifies a general note.

Warning

This element indicates a warning or caution.

Using Code Examples

Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/benniehaelen/delta-lake-up-and-running.

If you have a technical question or a problem using the code examples, please send email to .

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Delta Lake: Up and Running by Bennie Haelen and Dan Davis (O’Reilly). Copyright 2024 O’Reilly Media, Inc., 978-1-098-13972-8.”

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

O’Reilly Online Learning

Note

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

Acknowledgment

We would like to thank our technical reviewers: Adam Breindel, Andrei Ionescu, and Jobenish Purushothaman. Their attention to detail, feedback, and thoughtful suggestions played a pivotal role in helping shape the content of this book while ensuring its accuracy. Their input undoubtedly helped make this book a better quality product that will be a valuable resource to readers.

Aside from the technical reviewers, we also received valuable feedback throughout the process of writing the book from other contributors. We would like to extend our thanks to the following: Alex Ott, Anthony Krinsky, Artem Sheiko, Bilal Obeidat, Carlos Morillo, Eli Swanson, Guillermo G. Schiava D’Albano, Jitesh Soni, Joe Widen, Kyle Hale, Marco Scagliola, Nick Karpov, Nouran Younis, Ori Zohar, Sirui Sun, Susan Pierce, and Youssef Mrini. Without your input, this book would not be the valuable resource it is.

Finally, we would like to thank the open source community. Without the community’s contributions and collective efforts, Delta Lake would not have the remarkable capabilities it has today. The community’s commitment to innovation helps drive Delta Lake’s evolution and impact, and we, along with others, cannot express our thanks and appreciation enough.

Bennie Haelen

I would like to thank my wonderful wife Jenny. You have always been there to encourage and motivate me throughout the writing of this book; you are the great inspiration of my life. Thanks to my co-author Dan for being there through difficult periods in my life. Dan, you have a great career ahead of you. Thanks to my friends and colleagues that I can always reach out to with challenging questions no matter what time of the day.

Dan Davis

I would like to thank my family. Your continued encouragement and support have provided the foundation of my journey to where I am today and in writing this book. Thank you for always being a constant source of motivation. I would also like to thank all of my friends and colleagues that I have learned from and who have continually provided support to me along the way. I cannot thank my co-author, Bennie, enough. Thank you for being the mentor that you are, providing me with support, and presenting me with great opportunities. And last but not least, I would like to thank my beloved companion, who is always by my side whether he enjoys it or not, my dog River.

Get Delta Lake: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.