Getting Started with Workflow Orchestration

Beginner

Build, run, and monitor data pipelines at scale

Data engineers and scientists spend most of their time on negative or defensive engineering, writing code to handle unpredictable failures such as resources going down, APIs intermittently failing, or malformed data corrupting data pipelines. Workflow orchestration tools help eliminate negative engineering, allowing engineers and scientists to focus on the problems they are solving. Modern data applications have evolved, and orchestrators such as Prefect are providing more runtime flexibility and the ability to leverage distributed compute through Dask.

Join experts Kalise Richmond and Nate Nowack to discover how workflow orchestration can free you up to build solutions, not just avert failures. You’ll learn about basic orchestration features such as retries, scheduling, parameterization, caching, and secret management, and you’ll construct real data pipelines.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

The concept of negative engineering
What workflow orchestration is and the common features of these frameworks
Common workflows such as extract, transform, load (ETL) pipelines
How the modern data stack has evolved to require more from orchestrators
How orchestration frameworks provide more runtime flexibility by eliminating the directed acyclic graph (DAG) requirement
The role of distributed computing frameworks such as Dask in workflow orchestration
Advanced patterns, such as subflows
How you can combine cloud orchestration with on-premises execution

And you’ll be able to:

Deploy an open source orchestrator service in your own infrastructure
Frame data pipelines clearly before constructing them
Build a real data workflow that transforms and moves data from sources to sinks
Schedule the workflow on a regular basis
Add monitoring and respond to failure events
Leverage Dask to parallelize the workflow
Use caching and persistence of data to make future runs efficient

This live event is for you because...

You want to become a data engineer.
You’re a data scientist or data engineer looking to deploy and monitor data jobs.
You’re responsible for ensuring that critical data pipelines have minimal downtime.
You have a hobby project that runs a pipeline you need to put on a schedule.
You have workflows that don’t fit well into the DAG structure.
You want to learn advanced patterns in workflow orchestration.

Prerequisites

A computer with Prefect installed (can be installed with Python’s package manager by using pip)
Intermediate-level Python (familiarity with decorators, context managers, and classes)
Familiarity with SQL
Experience using APIs

Recommended preparation:

Watch “Files and Exceptions” and “Object-Oriented Programming” (lessons 9 and 10 in Python Fundamentals)
Read “Data Loading, Storage, and File Formats” and “Data Cleaning and Preparation” (chapters 6 and 7 in Python for Data Analysis)

Recommended follow-up:

Read Prefect Orion (supporting documentation)
Explore dbt (supporting documentation)
Read Dask (supporting documentation)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Negative engineering and workflow orchestration (30 minutes)

Presentation: Overview of materials; negative engineering–how production data pipelines can fail; consequences of pipeline failures; common workflow patterns; the need for workflow orchestration
Hands-on exercise: Native Python Work Example
Group discussion: What can you use workflow orchestration for?
Q&A

Prefect and basic orchestration features (35 minutes)

Presentation: Retries, parameters, and timeouts; scheduling; task library; secrets; async execution
Hands-on exercise: Putting Together a Simple Data Pipeline
Q&A
Break

Docker and Python packaging (25 minutes)

Presentation: The need for Docker; how to create a Python package; uploading an image to a registry
Hands-on exercise: Building a simple image
Q&A

Using distributed compute for parallel execution (20 minutes)

Presentation: What is Dask?; What makes Dask good for distributed compute?; depth-first execution/mapping; using Dask for parallelizing tasks
Hands-on exercise: Running a Flow on Dask
Q&A

Advanced patterns and subflows (30 minutes)

Presentation: Typing and pydantic; orchestration pattern with flow of flows; breaking the DAG; the need for runtime flexibility; analytics on top of workflow history
Hands-on exercise: Running a Flow without pre-registered components
Group discussion: What use cases need runtime flexibility?
Q&A
Break

Putting a real pipeline together (20 minutes)

Presentation: Introduction to ELT, Airbyte, dbt, and Snowflake
Hands-on exercise: Constructing and running an end-to-end pipeline

Wrap-up and Q&A (10 minutes)

Your Instructors

Kalise Richmond
Kalise Richmond is a sales engineer at Prefect, an open source workflow orchestration management system. Previously, she was a software engineer at Apptio, where she helped build analytic tools. Throughout her career, Kalise has run product training sessions for both technical and nontechnical audiences.

linkedin search
Nathan Nowack
Nate Nowack is a solutions engineer at Prefect, where he focuses on building out cloud infrastructure and distributed workflow orchestration patterns to support client projects. He’s also a contributor to Airbyte.

linkedin link search