Scale your Python processing with Dask
Published by O'Reilly Media, Inc.
Crunch big data easily in Python from a few cores to a few thousand machines
Python is maybe the preeminent language for data science. And the SciPy ecosystem enables hundreds of different use cases from astronomy to financial time series analysis to natural language processing. Most Python tools assume your data fits in memory, and many don’t support parallel execution, but today we have much more data and much more compute power. It’s time to scale open source Python tools to huge datasets and huge compute clusters.
Expert Adam Breindel takes a deep dive into the open source Dask project, which supports scaling the Python data ecosystem in a straightforward and understandable way and works well on anything from single laptops to thousand-machine clusters. You can use Dask to scale pandas DataFrames, scikit-learn ML, NumPy tensor operations, and more, as well as implement lower-level, custom task scheduling for more unusual algorithms. Dask plays nice with all the toys you want—Kubernetes for scaling, GPUs for acceleration, Parquet for data ingestion, and Datashader for visualization.
What you’ll learn and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- What Dask is and why it exists
- How Dask fits into the Python and big data landscape
- How Dask can help you process more data faster
And you’ll be able to:
- Begin building systems with Dask
- Add Dask and start incrementally migrating existing components
- Analyze data and train ML models with Dask
This live event is for you because...
- You’re a data engineer, data scientist, or natural or social scientist.
- You work with Python and data.
- You want to become a practitioner or leader who focuses on pragmatic, effective solutions.
Prerequisites
- A basic understanding of Python and the Python data science stack (pandas, NumPy, and scikit-learn)
Recommended preparation:
- Review Python in “Python Language Basics, IPython, and Jupyter Notebooks” and “Built-in Data Structures, Functions, and Files” (chapters 2 and 3 in Python for Data Analysis, second edition—useful but not required)
- Review NumPy and pandas in “NumPy Basics: Arrays and Vectorized Computation” and “Getting Started with pandas” (chapters 4 and 5 in Python for Data Analysis, second edition—useful but not required)
- Review ML in “The Machine Learning Landscape” and “End-to-End Machine Learning Project” (chapters 1 and 2 in Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, second edition—useful but not required)
- Review the most common ML techniques in chapters 3–7 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, second edition (useful but not required)
Recommended follow-up:
- Read Designing Data-Intensive Applications (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Introduction (55 minutes)
- Lecture: What Dask is, where it’s from, and what problems it solves; pandas-style analytics with pandas and Dask DataFrames
- Group discussion: Setting up and deploying Dask
- Hands-on exercise: Complete an analytics exercise
- Q&A
Break (5 minutes)
Dask graphical user interfaces (30 minutes)
- Lecture: Monitoring workers, tasks, and memory; using Dask’s built-in profiling to understand performance
- Group discussion: The biggest performance and troubleshooting challenges with big data
- Hands-on exercise: Analyze the performance of data transformation
- Q&A
Machine learning (25 minutes)
- Lecture: Modeling task; scikit-learn-style featurization with Dask
- Group discussion: Current algorithm support and integration
- Hands-on exercise: Try an alternate model
- Q&A
Break (5 minutes)
Additional data structure overview (25 minutes)
- Lecture: Dask array; Dask Bag
- Group discussion: What can we do with a Dask array?
- Hands-on exercise: Look at lower-level task graph opportunities in the docs
- Q&A
Best practices (20 minutes)
- Lecture: Managing partitions and tasks; caching
- Group discussion: File formats and data structures
Wrap-up and Q&A (15 minutes)
Your Instructor
Adam Breindel
Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.