Designing Data Pipelines—with Interactivity
Published by O'Reilly Media, Inc.
The nuts and bolts of designing stable, extensible, and scalable data pipelines
The data pipeline has become a fundamental component of the data science, data analyst, and data engineering workflow. Pipelines serve as the glue that links together various components of the data cleansing, data validation, and data transformation process. However, despite its importance to the data ecosystem, constructing the optimal data pipeline is generally an afterthought - if it’s considered at all. This makes any changes to the central pipeline highly error-prone and cumbersome. With the ever-growing demand for new kinds of data, especially from external vendors, constructing pipelines that are scalable and that allow for monitoring is pivotal for the safe and continued use of data.
This session will cover the core components that each data pipeline needs from an operational and functional perspective. We’ll discuss a framework that will allow practitioners to set their pipelines up for success. We’ll also discuss how to leverage data pipelines for metrics gathering and how pipelines can be architected to alert on potential data problems before the fact.
What you’ll learn and how you can apply it
By the end of this live online course, you’ll understand:
- How to build and design data pipelines, as well as what to consider before writing any code
- A framework for designing data pipelines, complete with the set of components to create a stable and extensible data pipeline
- How to break down a data pipeline into its core components to allow for smaller and more targeted changes
- Common terminology in the data and data pipeline ecosystems
And you’ll be able to:
- Evaluate the stability and extensibility of your data pipelines, and proactively address shortcomings
- Author and deploy pipelines into Apache Airflow
- Set up a mechanism to test data pipelines before promoting them to production
This live event is for you because...
- You are a data practitioner that writes and manages data transformation and data pipelines
- You work with data processing systems and want to ensure that these systems are extensible and functional
- You want to use every tool at your disposal to ensure that the data process flow in your organization is optimally designed
Prerequisites
- Experience with data analysis and data transformation
- Basic experience with data computation
- Basic familiarity with orchestration tools - cron, Airflow, etc..
Recommended preparation:
- Read Chapter 1 of Data Pipelines with Apache Airflow for an introduction to data pipelines
Recommended follow-up:
- Read Data Management at Scale (book)
- Finish reading Data Pipelines with Apache Airflow (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Introduction (40 minutes)
- Presentation: Why This Topic?
- Common terminology in the pipeline ecosystem
- Basic introduction to airflow
- Q&A
Building A Pipeline (75 minutes)
- (30 Minutes) Exercise: Introduction to Airflow
- Set up Airflow and a pipeline authoring environment
- Build basic 1 node and 2 node DAGs
- Gain familiarity with running DAGs
- (10 minutes) Break
- (30 minutes) Exercise: Basic ETL with Airflow
- Build your first ETL DAG in Airflow, replicating manual tasks done locally
- Authoring a DAG in airflow using python + Airflow Operators
- (5 minutes) Q&A
- Break (5 minutes)
Optimizing Your Pipeline (40 minutes)
- Presentation: “Basic ETL with Airflow Pitfalls”
- Exercise / Discussion
- Where can things go wrong in this pipeline?
- What can we do to safeguard against things going wrong?
- What metrics do we want to gather?
- How do we want to respond if/when something goes wrong?
- (30 minutes) Exercise: Implementing Best Practices on Basic ETL Pipeline
- Q&A
- Break (10 minutes)
Looking Forward (10 minutes)
- Presentation: What tools exist in the ecosystem that we can use to optimize our pipelines?
- Q&A / Discussion: What tools have you seen used in the past for monitoring pipelines and ensuring that they are healthy?
Conclusion (5 minutes)
- Recap what was taught / covered
- Show where more resources can be found online / elsewhere
- Final Q&A and Wrap Up
Your Instructor
Vinoo Ganesh
Vinoo Ganesh leads the deployed engineering team at Bluesky Data, a startup building the next generation of cloud data infrastructure. Prior to this role, Vinoo was Head of Business Engineering at Ashler Capital of the Citadel Investment Group, where he oversaw critical data pipelines and investment platforms. In the past, Vinoo worked as CTO of Veraset, a geospatial intelligence data-as-a-service startup (which processed over 2 TB of geospatial data) and led software engineering and forward deployed engineering teams at Palantir Technologies. He is also an experienced startup advisor, advising Databand.ai’s development of tools to solve data observibility problems across the stack as well as Horangi’s development of Warden, its best-in-class cybersecurity product.