Data Superstream: Building Data Pipelines and Connectivity
Published by O'Reilly Media, Inc.
Data pipelines are the foundation for success in data analytics, so understanding how they work is of the utmost importance. Join us for four hours of expert-led sessions that will give you insight into how data is moved, processed, and transformed to support analytics and reporting needs. You'll also learn how to address common challenges like monitoring and managing broken pipelines, explore considerations for choosing and connecting open source frameworks, commercial products, and homegrown solutions, and more.
About the Data Superstream Series: This three-part Superstream series is designed to help your organization maximize the business impact of your data. Each day covers different topics, with unique sessions lasting no more than four hours. And they’re packed with insights from key innovators and the latest tools and technologies to help you stay ahead of it all.
What you’ll learn and how you can apply it
- Learn how to build, deploy, and run a fully functioning ETL pipeline with Airflow
- Discover how to build robust data pipelines at scale
- Understand challenges in managing and monitoring hundreds of thousands of pipelines—and get tips on automating them
- Explore approaches to historical data preprocessing and data lifecycle management
This live event is for you because...
- You're a data or software engineer or solution architect interested in learning about the latest trends in moving, processing, and transforming data.
- You want to learn how to address common challenges and improve the scalability and stability of your pipelines.
- You want to better understand the systems that you already use and learn how to take full advantage of their capabilities.
Prerequisites
- Come with your questions
- Have a pen and paper handy to capture notes, insights, and inspiration
Recommended follow-up:
- Read Data Pipelines Pocket Reference (book)
- Read Data Science on AWS (book)
- Read Data Quality Fundamentals (early release book)
- Read What Is Data Observability? (report)
- Explore Build a Robust Data Pipeline (four-part interactive scenario set)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Alistair Croll: Introduction (5 minutes) - 8:00am PT | 11:00am ET | 3:00pm UTC/GMT
- Alistair Croll welcomes you to the Data Superstream.
Afua Bruce: Keynote—Maximizing Your Impact as a Data Scientist (10 minutes) - 8:05am PT | 11:05am ET | 3:05pm UTC/GMT
- As the demand for data skills rises, the traits of the data successful scientist continue to evolve. The world needs data scientists who possess a blend of technical and business skills but can also identify and implement responsible and inclusive data practices within their many spheres of influence. In her keynote address, Afua Bruce, author of The Tech That Comes Next, builds on her experience at New America, DataKind, and the FBI to look at the impact of the data systems we build.
- Afua Bruce is a leading public interest technologist who has spent her career working at the intersection of technology, policy, and society. She’s held senior science and technology positions at the White House, the FBI, IBM, and a couple of nonprofits. She’s currently a Technology and Public Purpose fellow in the Harvard Kennedy School. And as an If/Then Ambassador, Afua engages in efforts to excite girls to consider STEM careers; she has partnered with GoldieBlox, appeared on CBS’s Mission Unstoppable, and was one of 120 statues of women in STEM at the If/Then Exhibit on display at the Smithsonian. Her newest book, The Tech That Comes Next: How Changemakers, Technologists, and Philanthropists Can Build an Equitable World, describes how technology can advance equity.
Roksolana Diachuk: Modern Data Pipelines in AdTech—Life in the Trenches (30 minutes) - 8:15am PT | 11:15am ET | 3:15pm UTC/GMT
- The modern data pipelines approach helps us solve a number of challenges in different domains, including advertising. Join Roksolana Diachuk to learn how to use modern data pipelines for reporting and analytics by examining a case study of historical data reprocessing in AdTech. You’ll explore the problem itself, implementation, challenges, and future improvements. You’ll also dive into approaches to historical data preprocessing and data lifecycle management in depth—particularly useful for cases like business rule changes or errors in past data, where you need to reprocess your historical data (which requires a lot of time, precision, and computational resources).
- Roksolana Diachuk is a big data engineer at Captify who’s passionate about big data, Scala, and Kubernetes. Roksolana speaks at technical conferences and meetups in Europe and the US and is one of the leads of Women Who Code Kyiv. Her hobbies include building technical topics around fairytales and discovering new cities.
Michael Galarnyk: Bridging the Gap Between Data Pipelines and Machine Learning with MLOps (Sponsored by Intel) (30 minutes) - 8:45am PT | 11:45am ET | 3:45pm UTC/GMT
- Even if your data pipelines aren’t broken, it’s still important to monitor them. This is especially true if you want to utilize your data for machine learning. But making a data pipeline fit for machine learning use cases requires more than just additional data monitoring. Bringing machine learning into production has traditionally required a lot of manual setup and configuration even for toy ML pipelines; unfortunately, these manual methods aren’t reproducible, don’t autoscale, require significant technical expertise, and are prone to error. Join Michael Galarnyk to go deeper into data pipeline challenges in machine learning use cases. You’ll explore open source and commercial products for monitoring and managing ML pipelines in production; get an overview of MLOps and how it can be implemented to streamline your end-to-end machine learning workflow; and learn how to automate the AI model lifecycle with ready-to-use pipelines and low-code solutions. Plus, you’ll get example ML use cases you can use right away to achieve better results and extract more insights from your data.
- Michael Galarnyk works on low-code AI Blueprints at cnvrg.io. In his spare time, he teaches classes on Python-based machine learning through Stanford Continuing Studies and LinkedIn Learning. You can find him on Twitter, Medium, and GitHub.
- This session will be followed by a 30-minute Q&A in a breakout room. Stop by if you have more questions for Michael.
- Break (5 minutes)
Vinoo Ganesh: Zero to Pipeline (30 minutes) - 9:20am PT | 12:20pm ET | 4:20pm UTC/GMT
- There are few moments more daunting to data practitioners than deploying your first data pipeline. The flexibility, freedom, and development speed of the data pipeline ecosystem allows for endless tuning, customization, and configuration. . .but makes getting started overwhelming and difficult. In this live coding session, Vinoo Ganesh takes you through scoping, building, deploying, and running a fully functioning ETL pipeline in Airflow in just 30 minutes—all in a local developer environment. You’ll also learn how to simplify each step of the ETL process into a task in a job execution DAG. Join in to get the tools and knowledge to stand up your own pipeline developer environment at home.
- Vinoo Ganesh leads the deployed engineering team at Bluesky Data, a startup building the next generation of cloud data infrastructure. Previously, he was head of business engineering at Ashler Capital, a Citadel Investment Group business, where he oversaw critical data pipelines and investment platforms; was CTO of Veraset, a geospatial intelligence data-as-a-service startup (which processed over 2 TB of geospatial data); and led software engineering and forward-deployed engineering teams at Palantir. He’s an experienced startup advisor who has guided Databand.ai in developing tools to solve data observability problems across the stack and advised Horangi on Warden, its best-in-class cybersecurity product.
Holly Smith: 5 Mistakes Data Engineers Make When Building Their Pipelines (30 minutes) - 9:50am PT | 12:50pm ET | 4:50pm UTC/GMT
- Holly Smith works with the toughest data problems her customers face at Databricks. Over the past few years she’s seen how seemingly small bugs or overlooked config can mutate into horrific monstrosities that require armies of data engineers to extract their far-reaching, disgusting tentacles. Don’t be the person or team responsible for letting these monstrosities grow. Join Holly to dive into the worst examples the industry has to offer as you explore the top five mistakes data engineers make when building their pipelines. It doesn’t matter if you’re a practitioner or a decision maker—if your day job is data, this session is for you.
- Holly Smith is a multi-award-winning data and AI expert with over a decade of experience working with data and AI teams in a variety of capacities, from individual contributors all the way up to leadership. She’s spent the last three years at Databricks working with multinational companies as they embark on their journey to the cutting-edge of data. She also advises DataKind UK and Tech Talent Charter on data strategy and bringing data skills to nonprofits.
- Break (5 minutes)
Karen Li: Toward Real-Time Data Pipelines (Sponsored by Intel) (30 minutes) - 10:25am PT | 1:25pm ET | 5:25pm UTC/GMT
- Data is most valuable as soon as it’s generated. Data-driven organizations recognize this and are increasingly using streaming data in user-facing and operational analytics. Typical real-time use cases include risk operations, security analytics, logistics tracking, and real-time personalization. But with the move toward real-time analytics comes the need for pipelines capable of delivering real-time data to various applications. Karen Li shares principles and best practices for building real-time data pipelines. Join in to learn why flexible schemas, support for complex queries, and the ability to handle bursts in traffic and out-of-order events are much more important in real-time analytics than in traditional batch analytics.
- Karen Li is a software engineer on the systems team at Rockset, which is responsible for the company’s distributed SQL query engine. In her time at Rockset, she’s implemented SQL-based rollups, optimized distributed aggregations, and debugged gnarly production issues. She joined Rockset after graduating from UCLA with a bachelor's in computer science.
- This session will be followed by a 30-minute Q&A in a breakout room. Stop by if you have more questions for Karen.
Jun He: Building Robust Data Pipelines at Scale (30 minutes) - 10:55am PT | 1:55pm ET | 5:55pm UTC/GMT
- Data/ML pipelines have become central assets for businesses. As big data and ML grow more impactful, the scalability and stability of the ecosystem have become more important for both data scientists and the company at large. It’s now crucial to support pipelines for use cases that go beyond recommendations, predictions, and data transformations. Jun He shares his experience building and operating a workflow platform to build robust data pipelines at scale. You’ll learn the challenges he faced managing and monitoring hundreds of thousands of pipelines and the lessons he learned when automating the system. You’ll also get best practices for workflow lifecycle management and design philosophy.
- Jun He is a senior software engineer on the big data orchestration team at Netflix, where he leads the effort to build the big data workflow scheduler that manages and automates the company’s ML and data pipelines. He’s worked in distributed systems and infrastructure for the majority of his career. Previously, he spent a few years building distributed services and search infrastructure at Airbnb, where he was the main contributor for its message bus and search pipeline.
Alistair Croll: Closing Remarks (5 minutes) - 11:25am PT | 2:25pm ET | 6:25pm UTC/GMT
- Alistair Croll closes out today’s event.
Your Host
Alistair Croll
Alistair Croll is an entrepreneur, author, and conference organizer. He's written four books on technology and society, including the best-selling Lean Analytics, which has been translated into eight languages. He's the cofounder of web performance startup Coradiant (acquired by BMC), the Year One Labs startup accelerator, and a number of other early-stage companies.
A prolific speaker, Alistair was a visiting executive at Harvard Business School, where he helped create a course on data science and critical thinking. He's founded and chaired a number of the world's leading technology events, including Cloud Connect, Strata, Startupfest, Scaletech, and the FWD50 Digital Government conference. He's currently working on Just Evil Enough, the subversive marketing playbook. Alistair lives in Montreal, Canada, and writes at acroll.substack.com.