Getting Started with Workflow Orchestration
Published by O'Reilly Media, Inc.
Build, run, and monitor data pipelines at scale
Data engineers and scientists spend most of their time on negative or defensive engineering, writing code to handle unpredictable failures such as resources going down, APIs intermittently failing, or malformed data corrupting data pipelines. Workflow orchestration tools help eliminate negative engineering, allowing engineers and scientists to focus on the problems they are solving. Modern data applications have evolved, and orchestrators such as Prefect are providing more runtime flexibility and the ability to leverage distributed compute through Dask.
Join experts Kalise Richmond and Nate Nowack to discover how workflow orchestration can free you up to build solutions, not just avert failures. You’ll learn about basic orchestration features such as retries, scheduling, parameterization, caching, and secret management, and you’ll construct real data pipelines.
What you’ll learn and how you can apply it
By the end of this live online course, you’ll understand:
- The concept of negative engineering
- What workflow orchestration is and the common features of these frameworks
- Common workflows such as extract, transform, load (ETL) pipelines
- How the modern data stack has evolved to require more from orchestrators
- How orchestration frameworks provide more runtime flexibility by eliminating the directed acyclic graph (DAG) requirement
- The role of distributed computing frameworks such as Dask in workflow orchestration
- Advanced patterns, such as subflows
- How you can combine cloud orchestration with on-premises execution
And you’ll be able to:
- Deploy an open source orchestrator service in your own infrastructure
- Frame data pipelines clearly before constructing them
- Build a real data workflow that transforms and moves data from sources to sinks
- Schedule the workflow on a regular basis
- Add monitoring and respond to failure events
- Leverage Dask to parallelize the workflow
- Use caching and persistence of data to make future runs efficient
This live event is for you because...
- You want to become a data engineer.
- You’re a data scientist or data engineer looking to deploy and monitor data jobs.
- You’re responsible for ensuring that critical data pipelines have minimal downtime.
- You have a hobby project that runs a pipeline you need to put on a schedule.
- You have workflows that don’t fit well into the DAG structure.
- You want to learn advanced patterns in workflow orchestration.
Prerequisites
- A computer with Prefect installed (can be installed with Python’s package manager by using pip)
- Intermediate-level Python (familiarity with decorators, context managers, and classes)
- Familiarity with SQL
- Experience using APIs
Recommended preparation:
- Watch “Files and Exceptions” and “Object-Oriented Programming” (lessons 9 and 10 in Python Fundamentals)
- Read “Data Loading, Storage, and File Formats” and “Data Cleaning and Preparation” (chapters 6 and 7 in Python for Data Analysis)
Recommended follow-up:
- Read Prefect Orion (supporting documentation)
- Explore dbt (supporting documentation)
- Read Dask (supporting documentation)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Negative engineering and workflow orchestration (30 minutes)
- Presentation: Overview of materials; negative engineering–how production data pipelines can fail; consequences of pipeline failures; common workflow patterns; the need for workflow orchestration
- Hands-on exercise: Native Python Work Example
- Group discussion: What can you use workflow orchestration for?
- Q&A
Prefect and basic orchestration features (35 minutes)
- Presentation: Retries, parameters, and timeouts; scheduling; task library; secrets; async execution
- Hands-on exercise: Putting Together a Simple Data Pipeline
- Q&A
- Break
Docker and Python packaging (25 minutes)
- Presentation: The need for Docker; how to create a Python package; uploading an image to a registry
- Hands-on exercise: Building a simple image
- Q&A
Using distributed compute for parallel execution (20 minutes)
- Presentation: What is Dask?; What makes Dask good for distributed compute?; depth-first execution/mapping; using Dask for parallelizing tasks
- Hands-on exercise: Running a Flow on Dask
- Q&A
Advanced patterns and subflows (30 minutes)
- Presentation: Typing and pydantic; orchestration pattern with flow of flows; breaking the DAG; the need for runtime flexibility; analytics on top of workflow history
- Hands-on exercise: Running a Flow without pre-registered components
- Group discussion: What use cases need runtime flexibility?
- Q&A
- Break
Putting a real pipeline together (20 minutes)
- Presentation: Introduction to ELT, Airbyte, dbt, and Snowflake
- Hands-on exercise: Constructing and running an end-to-end pipeline
Wrap-up and Q&A (10 minutes)
Your Instructors
Kalise Richmond
Kalise Richmond is a sales engineer at Prefect, an open source workflow orchestration management system. Previously, she was a software engineer at Apptio, where she helped build analytic tools. Throughout her career, Kalise has run product training sessions for both technical and nontechnical audiences.
Nathan Nowack
Nate Nowack is a solutions engineer at Prefect, where he focuses on building out cloud infrastructure and distributed workflow orchestration patterns to support client projects. He’s also a contributor to Airbyte.