Fixing Data Quality at Scale with Data Observability—with Interactivity
Published by O'Reilly Media, Inc.
How to apply observability to your data pipelines
Do your product dashboards look funky? Are your quarterly reports way off? Are you sick and tired of running a SQL query only to discover that the dataset you’re using is broken or just plain wrong? These errors are highly costly and affect almost every team, yet they’re typically only addressed on an ad hoc basis and in a reactive manner.
As companies increasingly rely on data to lead operations and drive decision making, you need to ensure that your data pipelines are consistently healthy and reliable. In the same way that software developers tackle application downtime, data professionals deal with their own set of availability challenges. In other words, data downtime—the periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. To identify and eliminate data downtime, teams must leverage the five pillars of data observability and embrace automated checks to monitor pipeline performance.
Join experts Barr Moses and Ryan Kearns to learn how to minimize data downtime and increase observability into your data ecosystem. You’ll explore the concept of data downtime and see how to measure it to determine the quality and health of your data using SQL, a sample data table, and a Jupyter notebook. From there, you’ll apply software engineering principles of observability to your data through five key pillars of data health—volume, schema, lineage, freshness, and distribution—as you set service-level objectives for data observability in your data table and implement basic data observability checks. You’ll end by creating your very own anomaly detection algorithm that will help capture data downtime incidents in your data table.
What you’ll learn and how you can apply it
By the end of this live online course, you’ll understand:
- What data downtime is and how to measure it
- How to determine the quality of your data
- The five pillars of data observability
- How to set SLOs for data observability
- Basic data observability checks
- Best practices for eliminating data downtime
And you’ll be able to:
- Apply best practices from DevOps to data analytics and data engineering
- Write SQL scripts that accomplish basic data observability checks
- Identify broken data pipelines
- Perform basic data lineage searches
- Set alerts for data quality issues
This live event is for you because...
- You’re a data professional who relies on reliable, accurate data to generate rich analytics and won’t settle for anything less.
- You have a love-hate relationship with SQL and are constantly on the lookout for query hacks.
- You believe that data downtime doesn’t receive the diligence it deserves.
- You want to learn new ways to fold observability best practices into your data management routine.
Prerequisites
- A basic understanding of SQL
- Familiarity with common data warehouse technologies and the principles of DevOps observability
Recommended preparation:
- Read “The Rise of Data Downtime” (article)
- Read “What Is Data Observability?” (article)
- Read “Good Pipelines, Bad Data” (article)
- Take SQL Fundamentals for Data (live online training course with Thomas Nield)
Recommended follow-up:
- Read Cloud Native Data Center Networking (book)
- Read The Modern Data Warehouse in Azure (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Introducing data downtime (40 minutes)
- Presentation: Walk-through of a data downtime incident; defining data downtime; what data downtime looks like under the hood; measuring data downtime
- Group discussion: Have you encountered data downtime in your pipelines or analytics?; How much time do you spend on data downtime incidents?
- Jupyter Notebook exercise: Find the data issues in a dataset; measure data downtime
- Q&A
Break (5 minutes)
Introducing data observability (40 minutes)
- Presentation: Traditional methods of data quality monitoring—row counts and ad hoc queries; additional important measurements; applying best practices from software engineering and DevOps observability to data—SLOs, SLAs, and monitoring, alerting, and triaging; the five pillars of data observability—volume, schema, freshness, lineage, and distribution
- Jupyter Notebook exercise: Identify the five pillars of data observability in your dataset
- Q&A
Break (5 minutes)
Detecting data anomalies (40 minutes)
- Presentation: What is anomaly detection?; What are data anomalies, and how do you find them? (manual approaches, how AI can help); signs you have anomalous data
- Jupyter Notebook exercise: Create an anomaly detection algorithm (for data volume or freshness)
- Q&A
Break (5 minutes)
Eliminating data downtime (35 minutes)
- Presentation: Data observability principles to help you eliminate data downtime
- Jupyter Notebook exercise: Use your anomaly detection algorithm on your dataset; consider a few approaches to ensure long-term data observability
Wrap-up and Q&A (10 minutes)
Your Instructors
Barr Moses
Barr Moses is cofounder and CEO of Monte Carlo, a data reliability company backed by Accel and other top Silicon Valley investors. Previously, she was VP of customer operations at customer success company Gainsight, where she helped scale the company 10x in revenue and, among other functions, built the data and analytics team; a management consultant at Bain & Company; and a research assistant in the Statistics Department at Stanford. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr holds a BSc in mathematical and computational science from Stanford.
Ryan Kearns
Ryan Kearns is a founding data scientist at Monte Carlo, where he develops machine learning algorithms for the company’s data observability platform. Together with CEO and cofounder Barr Moses, he instructed the first course on data observability for O'Reilly—the first tutorial on the subject using out-of-the-box SQL. He received bachelor’s degrees in computer science and philosophy (with honors) from Stanford University.