Designing Data Pipelines—with Interactivity

Intermediate

The nuts and bolts of designing stable, extensible, and scalable data pipelines

This live event utilizes interactive environments

The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data.

This session will cover the core components that each data pipeline needs from an operational and functional perspective. We’ll discuss a framework that will allow practitioners to set their pipelines up for success. We’ll also discuss how to leverage data pipelines for metrics gathering and how pipelines can be architected to alert on potential data problems before the fact.

What you’ll learn and how you can apply it

By the end of this live online course, you’ll understand:

How to build and design data pipelines, as well as what to consider before writing any code
A framework for designing data pipelines, complete with the set of components to create a stable and extensible data pipeline
How to break down a data pipeline into its core components to allow for smaller and more targeted changes
Common terminology in the data and data pipeline ecosystems

And you’ll be able to:

Evaluate the stability and extensibility of your data pipelines, and proactively address shortcomings
Author and deploy pipelines into Apache Airflow
Set up a mechanism to test data pipelines before promoting them to production

This live event is for you because...

You are a data practitioner that writes and manages data transformation and data pipelines
You work with data processing systems and want to ensure that these systems are extensible and functional
You want to use every tool at your disposal to ensure that the data process flow in your organization is optimally designed

Prerequisites

Experience with data analysis and data transformation
Basic experience with data computation
Basic familiarity with orchestration tools - cron, Airflow, etc..

Recommended preparation:

Read Chapter 1 of Data Pipelines with Apache Airflow for an introduction to data pipelines

Recommended follow-up:

Read Data Management at Scale (book)
Finish reading Data Pipelines with Apache Airflow (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction (40 minutes)

Presentation: Why This Topic?
Common terminology in the pipeline ecosystem
Basic introduction to airflow
Q&A

Building A Pipeline (75 minutes)

(30 Minutes) Exercise: Introduction to Airflow
Set up Airflow and a pipeline authoring environment
Build basic 1 node and 2 node DAGs
Gain familiarity with running DAGs
(10 minutes) Break
(30 minutes) Exercise: Basic ETL with Airflow
Build your first ETL DAG in Airflow, replicating manual tasks done locally
Authoring a DAG in airflow using python + Airflow Operators
(5 minutes) Q&A
Break (5 minutes)

Optimizing Your Pipeline (40 minutes)

Presentation: “Basic ETL with Airflow Pitfalls”
Exercise / Discussion
Where can things go wrong in this pipeline?
What can we do to safeguard against things going wrong?
What metrics do we want to gather?
How do we want to respond if/when something goes wrong?
(30 minutes) Exercise: Implementing Best Practices on Basic ETL Pipeline
Q&A
Break (10 minutes)

Looking Forward (10 minutes)

Presentation: What tools exist in the ecosystem that we can use to optimize our pipelines?
Q&A / Discussion: What tools have you seen used in the past for monitoring pipelines and ensuring that they are healthy?

Conclusion (5 minutes)

Recap what was taught / covered
Show where more resources can be found online / elsewhere
Final Q&A and Wrap Up

Your Instructor

Vinoo Ganesh
Vinoo Ganesh leads the deployed engineering team at Bluesky Data, a startup building the next generation of cloud data infrastructure. Prior to this role, Vinoo was Head of Business Engineering at Ashler Capital of the Citadel Investment Group, where he oversaw critical data pipelines and investment platforms. In the past, Vinoo worked as CTO of Veraset, a geospatial intelligence data-as-a-service startup (which processed over 2 TB of geospatial data) and led software engineering and forward deployed engineering teams at Palantir Technologies. He is also an experienced startup advisor, advising Databand.ai’s development of tools to solve data observibility problems across the stack as well as Horangi’s development of Warden, its best-in-class cybersecurity product.

linkedin search