Python Polars: The Definitive Guide

Book description

Want to speed up your data analysis and work with larger-than-memory datasets? Python Polars offers a blazingly fast, multithreaded, and elegant API for data loading, manipulation, and processing. With this hands-on guide, you'll walk through every aspect of Polars and learn how to tackle practical use cases using real-world datasets.

Jeroen Janssens and Thijs Nieuwdorp from Xomnia in Amsterdam show you how this superfast DataFrame library is perfect for efficient data wrangling, ETL pipelines, and so much more. This book helps you quickly learn the syntax and understand Polars' underlying concepts. You don't need to have experience with pandas or Spark, but if you do, this book will help you make a smooth transition.

With this definitive guide at your side, you'll be able to:

  • Process larger-than-memory datasets at record speed
  • Apply the eager, lazy, and streaming APIs of Polars and decide when to use them
  • Transition smoothly from pandas or Spark to Polars
  • Integrate Polars into your existing code base
  • Work with Arrow and Parquet to efficiently read and write data
  • Translate complex ETL tasks into efficient and elegant queries

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Who This Book Is For
    2. Conventions Used in This Book
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  3. I. Begin
  4. 1. Introducing Polars
    1. What Is This Thing Called Polars?
      1. Features
      2. Key Concepts
      3. Advantages
    2. Why You Should Use Polars
      1. Performance
      2. Usability
      3. Popularity
      4. Sustainability
    3. Polars Compared to Other Data Processing Packages
    4. Why We Focus on Python Polars
    5. How This Book is Organized
    6. An ETL Showcase
      1. Extract
      2. Bonus: Visualizing Neighborhoods and Stations
      3. Transform
      4. Bonus: Visualizing Daily Trips per Borough
      5. Load
      6. Bonus: Becoming Faster by Being Lazy
    7. Takeaways
  5. 2. Getting Started
    1. Setting Up Your Environment
      1. Downloading the Project
      2. Installing uv
      3. Installing the Project
      4. Working with the Virtual Environment
      5. Verifying Your Installation
    2. Crash Course JupyterLab
      1. Keyboard Shortcuts
    3. Installing Polars on Other Projects
      1. All Optional Dependencies
      2. Optional Dependencies for Interoperability
      3. Optional Dependencies for Working with Spreadsheets
      4. Optional Dependencies for Working with Databases
      5. Optional Dependencies for Working with Remote File Systems
      6. Optional Dependencies for Other I/O Formats
      7. Optional Dependencies for Extra Functionality
      8. Installing Optional Dependencies
    4. Configuring Polars
      1. Temporary Configuration Using a Context Manager
      2. Local Configuration Using a Decorator
    5. Compiling Polars from Scratch
      1. Edge Case: Very Large Datasets
      2. Edge Case: Processors Lacking AVX support
    6. Takeaways
  6. 3. Moving from Pandas to Polars
    1. Animals
    2. Similarities to Recognize
    3. Appearances to Appreciate
      1. Differences in Code
      2. Differences in Display
    4. Concepts to Unlearn
      1. Index
      2. Axes
      3. Indexing and Slicing
      4. Eagerness
      5. Relaxedness
    5. Syntax to Forget
      1. Common Operations Side By Side
    6. To and From Pandas
    7. Takeaways
  7. II. Form
  8. 4. Data Structures and Data Types
    1. Series, DataFrames, and LazyFrames
    2. Data Types
      1. Nested Data Types
      2. Missing Values
    3. Data Type Conversion
    4. Takeaways
  9. 5. Eager and Lazy APIs
    1. Eager API: DataFrame
    2. Lazy API: LazyFrame
    3. Performance Differences
    4. Functionality Differences
      1. Attributes
      2. Aggregation Methods
      3. Computation Methods
      4. Descriptive Methods
      5. Group By Methods
      6. Exporting Methods
      7. Manipulation and Selection Methods
      8. Miscellaneous Methods
    5. Tips and Tricks
      1. Going from LazyFrame to DataFrame and Vice Versa
      2. Joining a DataFrame With a LazyFrame
      3. Caching Intermittent Results
    6. Takeaways
  10. 6. Reading and Writing Data
    1. Format Overview
    2. Reading CSV Files
    3. Parsing Missing Values Correctly
    4. Reading Files with Encodings Other than UTF-8
    5. Reading Excel Spreadsheets
    6. Working with Multiple Files
    7. Reading Parquet
    8. Reading JSON and NDJSON
      1. JSON
      2. NDJSON
    9. Other File Formats
    10. Querying Databases
    11. Writing Data
      1. CSV Format
      2. Excel Format
      3. Parquet Format
      4. Other Considerations
    12. Takeaways
  11. III. Express
  12. 7. Beginning Expressions
    1. Methods and Namespaces
    2. Expressions by Example
      1. Selecting Columns with Expressions
      2. Creating New Columns with Expressions
      3. Filtering Rows with Expressions
      4. Aggregating with Expressions
      5. Sorting Rows with Expressions
    3. The Definition of an Expression
      1. Properties of Expressions
    4. Creating Expressions
      1. From Existing Columns
      2. From Literal Values
      3. From Ranges
      4. Other Functions to Create Expressions
    5. Renaming Expressions
    6. Expressions Are Idiomatic
    7. Takeaways
  13. 8. Continuing Expressions
    1. Types of Operations
      1. Example A: Element-Wise Operations
      2. Example B: Operations that Summarize to One
      3. Example C: Operations that Summarize to One or More
      4. Example D: Operations that Extend
    2. Element-Wise Operations
      1. Operations That Perform Mathematical Transformations
      2. Operations Related to Trigonometry
      3. Operations That Round and Categorize
      4. Operations for Missing or Infinite Values
      5. Other Operations
    3. Nonreducing Series-Wise Operations
      1. Operations That Accumulate
      2. Operations That Fill and Shift
      3. Operations Related to Duplicate Values
      4. Operations That Compute Rolling Statistics
      5. Operations That Sort
      6. Other Operations
    4. Series-Wise Operations that Summarize to One
      1. Operations That Are Quantifiers
      2. Operations That Compute Statistics
      3. Operations That Count
      4. Other Operations
    5. Series-Wise Operations that Summarize to One or More
      1. Operations Related to Unique Values
      2. Operations That Select
      3. Operations That Drop Missing Values
      4. Other Operations
    6. Series-Wise Operations that Extend
    7. Takeaways
  14. 9. Combining Expressions
    1. Inline Operators Versus Methods
    2. Arithmetic Operations
    3. Comparison Operations
    4. Boolean Algebra Operations
    5. Bitwise Operations
    6. Using Functions
      1. When, Then, Otherwise
    7. Takeaways
  15. IV. Transform
  16. 10. Selecting and Creating Columns
    1. Selecting Columns
      1. Introducing Selectors
      2. Selecting Based on Name
      3. Selecting Based on Data Type
      4. Selecting Based on Position
      5. Combining Selectors
    2. Creating Columns
    3. Related Column Operations
    4. Takeaways
  17. 11. Filtering and Sorting Rows
    1. Filtering Rows
      1. Filtering Based on Expressions
      2. Filtering Based on Column Names
      3. Filtering Based on Constraints
    2. Sorting Rows
      1. Sorting Based On a Single Column
      2. Sorting in Reverse
      3. Sorting Based on Multiple Columns
      4. Sorting Based on Expressions
      5. Sorting Nested Data Types
    3. Related Row Operations
    4. Takeaways
  18. 12. Working with Textual, Temporal, and Nested Data Types
    1. String
      1. String Methods
      2. String Examples
    2. Categorical
      1. Categorical Methods
      2. Categorical Examples
    3. Enum
    4. Temporal
      1. Temporal Methods
      2. Temporal Examples
    5. List
      1. List Methods
      2. List Examples
    6. Array
      1. Array Methods
      2. Array Examples
    7. Struct
      1. Struct Methods
      2. Struct Examples
    8. Takeaways
  19. 13. Summarizing and Aggregating
    1. Split, Apply, and Combine
    2. GroupBy Context
    3. The Descriptives
    4. The Advanced
    5. Aggregate Values to a List
    6. Rename Aggregated Columns
    7. Apply Multiple Aggregations At Once
    8. Row-Wise Aggregations
    9. Window Functions in Selection Context
    10. Dynamic Grouping
    11. Rolling Aggregations
    12. Upsampling
    13. Takeaways
  20. 14. Joining and Concatenating
    1. Joining
      1. Join Strategies
      2. Joining on Multiple Columns
      3. Validation
    2. Inexact Joining
      1. Inexact Join Strategies
      2. Additional Finetuning
      3. Use Case: Marketing Campaign Attribution
    3. Vertical and Horizontal Concatenation
      1. Vertical
      2. Horizontal
      3. Diagonal
      4. Align
      5. Relaxed
      6. Stacking
      7. Appending
      8. Extending
    4. Takeaways
  21. 15. Reshaping
    1. Wide Versus Long DataFrames
    2. Pivot to Wider DataFrame
    3. Unpivot to Longer DataFrame
    4. Transposing
    5. Exploding
    6. Partition into Multiple DataFrames
    7. Takeaways
  22. V. Advance
  23. 16. Visualizing Data
    1. NYC Bike Trips
    2. Built-in Plotting with Altair
      1. Introducing Altair
      2. Methods in the Plot Namespaces
      3. Plotting DataFrames
      4. Too Large to Handle
      5. Plotting Series
    3. Pandas-like Plotting With hvPlot
      1. Introducing hvPlot
      2. A First Plot
      3. Methods in the hvPlot Namespace
      4. Pandas as Backup
      5. Manual Transformations
      6. Changing the Plotting Backend
      7. Plotting Points on a Map
      8. Composing Plots
      9. Adding Interactive Widgets
    4. Publication-Quality Graphics with Plotnine
      1. Introducting Plotnine
      2. Plots For Exploration
      3. Plots For Communication
    5. Styling DataFrames With Great Tables
    6. Takeaways
  24. 17. Extending Polars
    1. User Defined Functions in Python
      1. Applying a Function to Elements
      2. Applying a Function to Series
      3. Applying a Function to Groups
      4. Applying a Function to an Expression
      5. Applying a Function to a DataFrame or LazyFrame
    2. Registering Your Own Namespace
    3. Polars Plug-Ins in Rust
      1. Prerequisites
      2. The Anatomy of a Plug-in Project
      3. The Plug-in
      4. Compiling the Plug-in
      5. Performance Benchmark
      6. Register Arguments
      7. Using a Rust Crate
      8. Use Case: Geo
    4. Takeaways
  25. 18. Polars Internals
    1. Polars’ Architecture
    2. Arrow
    3. Multi-Threaded Computations and SIMD Operations
    4. The String Data Type in Memory
    5. ChunkedArrays in Series
    6. Query Optimization
      1. LazyFrame Scan Level Optimizations
      2. Other Optimizations
    7. Checking Your Expressions
      1. Meta Namespace Overview
      2. Meta Namespace Examples
    8. Profiling Polars
    9. Tests in Polars
      1. Comparing DataFrames and Series
    10. Common Anti-patterns
      1. Using Brackets for Column Selection
      2. Misusing Collect
      3. Using Python Code in your Polars Queries
    11. Takeaways
  26. A. Accelerating Polars with the GPU
    1. NVIDIA RAPIDS
    2. Installing the GPU Engine
      1. Step 1: Install WSL2 on Windows
      2. Step 2: Install Ubuntu Linux on WSL2
      3. Step 3: Install Prerequisite Ubuntu Linux Packages
      4. Step 4: Install the CUDA Toolkit
      5. Step 5: Install Python Dependencies
      6. Step 6: Test Your Installation
    3. Using the Polars GPU Engine
      1. Configuration
      2. Unsupported Features
    4. Benchmarking the Polars GPU Engine
      1. Solutions
      2. Queries and Data
      3. Methods
      4. Results and Discussion
      5. Conclusion
    5. The Future of Polars on the GPU
    6. Takeaways
  27. Index
  28. About the Authors

Product information

  • Title: Python Polars: The Definitive Guide
  • Author(s): Jeroen Janssens, Thijs Nieuwdorp
  • Release date: February 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098156084