Skip to content
  • Sign In
  • Try Now
View all events
Data Engineering

Getting Started with Workflow Orchestration

Published by O'Reilly Media, Inc.

Beginner content levelBeginner

Build, run, and monitor data pipelines at scale

Data engineers and scientists spend most of their time on negative or defensive engineering, writing code to handle unpredictable failures such as resources going down, APIs intermittently failing, or malformed data corrupting data pipelines. Workflow orchestration tools help eliminate negative engineering, allowing engineers and scientists to focus on the problems they are solving. Modern data applications have evolved, and orchestrators such as Prefect are providing more runtime flexibility and the ability to leverage distributed compute through Dask.

Join experts Kalise Richmond and Nate Nowack to discover how workflow orchestration can free you up to build solutions, not just avert failures. You’ll learn about basic orchestration features such as retries, scheduling, parameterization, caching, and secret management, and you’ll construct real data pipelines.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

  • The concept of negative engineering
  • What workflow orchestration is and the common features of these frameworks
  • Common workflows such as extract, transform, load (ETL) pipelines
  • How the modern data stack has evolved to require more from orchestrators
  • How orchestration frameworks provide more runtime flexibility by eliminating the directed acyclic graph (DAG) requirement
  • The role of distributed computing frameworks such as Dask in workflow orchestration
  • Advanced patterns, such as subflows
  • How you can combine cloud orchestration with on-premises execution

And you’ll be able to:

  • Deploy an open source orchestrator service in your own infrastructure
  • Frame data pipelines clearly before constructing them
  • Build a real data workflow that transforms and moves data from sources to sinks
  • Schedule the workflow on a regular basis
  • Add monitoring and respond to failure events
  • Leverage Dask to parallelize the workflow
  • Use caching and persistence of data to make future runs efficient

This live event is for you because...

  • You want to become a data engineer.
  • You’re a data scientist or data engineer looking to deploy and monitor data jobs.
  • You’re responsible for ensuring that critical data pipelines have minimal downtime.
  • You have a hobby project that runs a pipeline you need to put on a schedule.
  • You have workflows that don’t fit well into the DAG structure.
  • You want to learn advanced patterns in workflow orchestration.

Prerequisites

  • A computer with Prefect installed (can be installed with Python’s package manager by using pip)
  • Intermediate-level Python (familiarity with decorators, context managers, and classes)
  • Familiarity with SQL
  • Experience using APIs

Recommended preparation:

Recommended follow-up:

  • Read Prefect Orion (supporting documentation)
  • Explore dbt (supporting documentation)
  • Read Dask (supporting documentation)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Negative engineering and workflow orchestration (30 minutes)

  • Presentation: Overview of materials; negative engineering–how production data pipelines can fail; consequences of pipeline failures; common workflow patterns; the need for workflow orchestration
  • Hands-on exercise: Native Python Work Example
  • Group discussion: What can you use workflow orchestration for?
  • Q&A

Prefect and basic orchestration features (35 minutes)

  • Presentation: Retries, parameters, and timeouts; scheduling; task library; secrets; async execution
  • Hands-on exercise: Putting Together a Simple Data Pipeline
  • Q&A
  • Break

Docker and Python packaging (25 minutes)

  • Presentation: The need for Docker; how to create a Python package; uploading an image to a registry
  • Hands-on exercise: Building a simple image
  • Q&A

Using distributed compute for parallel execution (20 minutes)

  • Presentation: What is Dask?; What makes Dask good for distributed compute?; depth-first execution/mapping; using Dask for parallelizing tasks
  • Hands-on exercise: Running a Flow on Dask
  • Q&A

Advanced patterns and subflows (30 minutes)

  • Presentation: Typing and pydantic; orchestration pattern with flow of flows; breaking the DAG; the need for runtime flexibility; analytics on top of workflow history
  • Hands-on exercise: Running a Flow without pre-registered components
  • Group discussion: What use cases need runtime flexibility?
  • Q&A
  • Break

Putting a real pipeline together (20 minutes)

  • Presentation: Introduction to ELT, Airbyte, dbt, and Snowflake
  • Hands-on exercise: Constructing and running an end-to-end pipeline

Wrap-up and Q&A (10 minutes)

Your Instructors

  • Kalise Richmond

    Kalise Richmond is a sales engineer at Prefect, an open source workflow orchestration management system. Previously, she was a software engineer at Apptio, where she helped build analytic tools. Throughout her career, Kalise has run product training sessions for both technical and nontechnical audiences.

  • Nathan Nowack

    Nate Nowack is a solutions engineer at Prefect, where he focuses on building out cloud infrastructure and distributed workflow orchestration patterns to support client projects. He’s also a contributor to Airbyte.