Book description
Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.
Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA.
With this book, you'll learn:
- What Dask is, where you can use it, and how it compares with other tools
- How to use Dask for batch data parallel processing
- Key distributed system concepts for working with Dask
- Methods for using Dask with higher-level APIs and building blocks
- How to work with integrated libraries such as scikit-learn, pandas, and PyTorch
- How to use Dask with GPUs
Publisher resources
Table of contents
- Preface
- 1. What Is Dask?
- 2. Getting Started with Dask
- 3. How Dask Works: The Basics
-
4. Dask DataFrame
- How Dask DataFrames Are Built
- Loading and Writing
- Indexing
- Shuffles
- Embarrassingly Parallel Operations
- Working with Multiple DataFrames
- What Does Not Work
- What’s Slower
- Handling Recursive Algorithms
- Re-computed Data
- How Other Functions Are Different
- Data Science with Dask DataFrame: Putting It Together
- Conclusion
- 5. Dask’s Collections
- 6. Advanced Task Scheduling: Futures and Friends
- 7. Adding Changeable/Mutable State with Dask Actors
- 8. How to Evaluate Dask’s Components and Libraries
- 9. Migrating Existing Analytic Engineering
- 10. Dask with GPUs and Other Special Resources
- 11. Machine Learning with Dask
- 12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring
-
A. Key System Concepts for Dask Users
- Testing
- Data and Output Validation
- Peer-to-Peer Versus Centralized Distributed
- Methods of Parallelism
- Network Fault Tolerance and CAP Theorem
- Recursion (Tail and Otherwise)
- Versioning and Branching: Code and Data
- Isolation and Noisy Neighbors
- Machine Fault Tolerance
- Scalability (Up and Down)
- Cache, Memory, Disk, and Networking: How the Performance Changes
- Hashing
- Data Locality
- Exactly Once Versus At Least Once
- Conclusion
- B. Scalable DataFrames: A Comparison and Some History
- C. Debugging Dask
- D. Streaming with Streamz and Dask
- Index
- About the Authors
Product information
- Title: Scaling Python with Dask
- Author(s):
- Release date: July 2023
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781098119874
You might also like
book
Scaling Python with Ray
Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've …
book
Python Distilled
Expert Insight for Modern Python (3.6+) Development from the Author of Python Essential Reference The richness …
book
Python Concurrency with asyncio
Learn how to speed up slow Python code with concurrent programming and the cutting-edge asyncio library. …
video
Spark, Ray, and Python for Scalable Data Science
7.5 Hours of Video Instruction Conceptual overviews and code-along sessions get you scaling up your data …