Skip to content
  • Sign In
  • Try Now
View all events
Apache Airflow

Designing Data Pipelines—with Interactivity

Published by O'Reilly Media, Inc.

Intermediate content levelIntermediate

The nuts and bolts of designing stable, extensible, and scalable data pipelines

This live event utilizes interactive environments

The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data.

This session will cover the core components that each data pipeline needs from an operational and functional perspective. We’ll discuss a framework that will allow practitioners to set their pipelines up for success. We’ll also discuss how to leverage data pipelines for metrics gathering and how pipelines can be architected to alert on potential data problems before the fact.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

  • How to build and design data pipelines, as well as what to consider before writing any code
  • A framework for designing data pipelines, complete with the set of components to create a stable and extensible data pipeline
  • How to break down a data pipeline into its core components to allow for smaller and more targeted changes
  • Common terminology in the data and data pipeline ecosystems

And you’ll be able to:

  • Evaluate the stability and extensibility of your data pipelines, and proactively address shortcomings
  • Author and deploy pipelines into Apache Airflow
  • Set up a mechanism to test data pipelines before promoting them to production

This live event is for you because...

  • You are a data practitioner that writes and manages data transformation and data pipelines
  • You work with data processing systems and want to ensure that these systems are extensible and functional
  • You want to use every tool at your disposal to ensure that the data process flow in your organization is optimally designed

Prerequisites

  • Experience with data analysis and data transformation
  • Basic experience with data computation
  • Basic familiarity with orchestration tools - cron, Airflow, etc..

Recommended preparation:

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction (40 minutes)

  • Presentation: Why This Topic?
  • Common terminology in the pipeline ecosystem
  • Basic introduction to airflow
  • Q&A

Building A Pipeline (75 minutes)

  • (30 Minutes) Exercise: Introduction to Airflow
  • Set up Airflow and a pipeline authoring environment
  • Build basic 1 node and 2 node DAGs
  • Gain familiarity with running DAGs
  • (10 minutes) Break
  • (30 minutes) Exercise: Basic ETL with Airflow
  • Build your first ETL DAG in Airflow, replicating manual tasks done locally
  • Authoring a DAG in airflow using python + Airflow Operators
  • (5 minutes) Q&A
  • Break (5 minutes)

Optimizing Your Pipeline (40 minutes)

  • Presentation: “Basic ETL with Airflow Pitfalls”
  • Exercise / Discussion
  • Where can things go wrong in this pipeline?
  • What can we do to safeguard against things going wrong?
  • What metrics do we want to gather?
  • How do we want to respond if/when something goes wrong?
  • (30 minutes) Exercise: Implementing Best Practices on Basic ETL Pipeline
  • Q&A
  • Break (10 minutes)

Looking Forward (10 minutes)

  • Presentation: What tools exist in the ecosystem that we can use to optimize our pipelines?
  • Q&A / Discussion: What tools have you seen used in the past for monitoring pipelines and ensuring that they are healthy?

Conclusion (5 minutes)

  • Recap what was taught / covered
  • Show where more resources can be found online / elsewhere
  • Final Q&A and Wrap Up

Your Instructor

  • Vinoo Ganesh

    Vinoo Ganesh leads the deployed engineering team at Bluesky Data, a startup building the next generation of cloud data infrastructure. Prior to this role, Vinoo was Head of Business Engineering at Ashler Capital of the Citadel Investment Group, where he oversaw critical data pipelines and investment platforms. In the past, Vinoo worked as CTO of Veraset, a geospatial intelligence data-as-a-service startup (which processed over 2 TB of geospatial data) and led software engineering and forward deployed engineering teams at Palantir Technologies. He is also an experienced startup advisor, advising Databand.ai’s development of tools to solve data observibility problems across the stack as well as Horangi’s development of Warden, its best-in-class cybersecurity product.