DuckDB: Up and Running

Book description

DuckDB is an open source in-process database created for OLAP workloads. It provides key advantages that separate this database from more mainstream OLAP solutions, including embeddability, compatibility with SQL, optimization for fast and efficient analytics, and integration with Python. This practical book shows you how DuckDB leverages Python libraries and tools for data analytics, machine learning, and AI.

Author Wei-Meng Lee shows developers, data engineers, data analysts, and data scientists how to get started. You'll learn the primary features and functions of DuckDB, explore use cases and best practices, and examine practical examples of how DuckDB can be used for a variety of data analytics tasks. You'll also dive into specific topics including how to import data into DuckDB, work with tables, perform exploratory data analysis, visualize DuckDB data, perform spatial analysis, and use DuckDB with JSON files, Polars, and JupySQL.

You'll also explore:

  • The purpose of DuckDB and its main functions
  • How to conduct data analytics tasks using DuckDB
  • Methods for integrating DuckDB with pandas, Polars, and JupySQL
  • How to use DuckDB to query your data
  • Ways to perform spatial analytics using DuckDB's spatial extension
  • How to work with a diverse range of data including Parquet, CSV, and JSON

Wei-Meng Lee is a technologist and founder of Developer Learning Solutions, a company that provides hands-on training on the latest technologies.

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. 1. Getting Started with DuckDB
    1. Introduction to DuckDB
      1. Why Use DuckDB?
      2. High Performance Analytical Queries
      3. Versatile Integration and Ease of Use Across Multiple Programming Languages
      4. Open Source
    2. Quick look at DuckDB
      1. Loading Data into DuckDB
      2. Inserting a Record
      3. Querying a Table
      4. Performing Aggregation
      5. Joining Tables
      6. Reading Data from Pandas
    3. Conclusion
  3. 2. Importing Data into DuckDB
    1. Creating DuckDB Databases
    2. Loading Data from Different Data Sources and Formats
      1. Working with CSV Files
      2. Working with Parquet Files
      3. Working with Excel Files
      4. Working with MySQL
    3. Conclusion
  4. 3. A Primer on SQL
    1. Using the DuckDB CLI
      1. Importing Data into DuckDB
      2. Dot Commands
      3. Persisting the Database on Disk
    2. SQL Primer
      1. Creating a Database
      2. Creating Tables
      3. Viewing the Schemas of Tables
      4. Dropping a Table
    3. Working with Tables
      1. Populating Tables with Rows
      2. Updating Rows
      3. Deleting Rows
      4. Querying Tables
      5. Joining Tables
      6. Aggregating data
      7. Analytics
    4. Summary
  5. 4. Using DuckDB with Polars
    1. Introduction to Polars
      1. Creating a Polars DataFrame
      2. Understanding Lazy Evaluation in Polars
    2. Querying Polars DataFrames using DuckDB
      1. Using the sql() Function
      2. Using the DuckDBPyRelation object
    3. Conclusion
  6. 5. Performing EDA with DuckDB
    1. Our Dataset – The 2015 Flights Delay Dataset
    2. Geospatial Analytics
      1. Displaying a Map
      2. Displaying all the Airports on the Map
      3. Using the Spatial Extension in DuckDB
    3. Performing Descriptive Analytics
      1. Finding the Airports for Each State and City
      2. Aggregating the Total Number of Airports in Each State
      3. Obtaining the Flight Counts for Each Pair of Origin and Destination Airports
      4. Getting the Cancelled Flights from Airlines
      5. Getting the Flight Count for Each Day of Week
      6. Finding the Most Common Timeslot for Flight Delays
      7. Finding the Airlines with the Most and Least Delays
    4. Summary
  7. 6. Using DuckDB with JSON Files
    1. Primer on JSON
    2. Object
    3. String
    4. Boolean
    5. Number
    6. Nested Object
    7. Arrays
    8. null
    9. Loading JSON Files into DuckDB
      1. Using the read_json_auto() function
      2. Using the read_json() function
      3. Using the COPY-FROM statement
    10. Exporting Tables to JSON
    11. Conclusion
  8. 7. Using DuckDB with JupySQL
    1. What is JupySQL?
      1. Installing JupySQL
      2. Loading the sql Extension
      3. Integrating with DuckDB
      4. Performing Queries
      5. Storing Snippets
    2. Visualization
      1. Histogram
      2. Pie Chart
      3. Bar Plot
    3. Integrating with MySQL
      1. Using Environment Variables
      2. Using .ini File
      3. Using Keyring
    4. Conclusion
  9. 8. Accessing Remote Data using DuckDB
    1. DuckDB’s httpfs Extension
    2. Querying CSV and Parquet Files Remotely
      1. Accessing Files on GitHub
      2. Accessing Remote Parquet Files
    3. Querying Hugging Face Dataset
      1. Using Hugging Face Dataset
      2. Reading the Dataset using hf:// Paths
      3. Accessing Files Within a Folder
      4. Querying Multiple Files using the Glob Syntax
      5. Working with Private Hugging Face Dataset
    4. Conclusion
  10. About the Author

Product information

  • Title: DuckDB: Up and Running
  • Author(s): Wei-Meng Lee
  • Release date: January 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098159696