Data Superstream: Becoming a Data Engineer
Published by O'Reilly Media, Inc.
Building a career in a rapidly evolving discipline
This Superstream brings together an amazing slate of data industry experts and O’Reilly authors to enrich your professional journey with personal experiences, real-world insights, critical skills, and best practices. Whether you’re an experienced engineer looking to expand your knowledge of tools, technologies, and techniques, or you’re just considering the leap into a new career, you’ll come away from this conference with valuable new ideas and perspectives.
You’ll hear from Andy Petrella (author of Fundamentals of Data Observability) on the critical role of data engineers in maintaining data integrity to prevent misinformation and manipulation in the age of AI; Eevamaija Virtanen (founder of Helsinki Data Week) on getting a first job and building a career; Adi Polak (author of Scaling Machine Learning with Spark) on stream processing patterns and open source software; Dunith Danushka (senior developer advocate at Redpanda Data) on building composable data platforms; Colleen Tartow (field CTO at VAST Data) on her career journey from astrophysics to engineering leadership; Xinran Waibel (founder of the Data Engineer Things Community) on the path to becoming a senior engineer; Holden Karau (author of five O’Reilly books) on concrete applications of data engineering to healthcare; and Jowanza Joseph (author of Mastering Apache Pulsar) on applying generative AI to data engineering problems.
About the Data Superstream Series: This two-part Superstream series is designed to help your organization maximize the business impact of your data. Each day covers different topics, with unique sessions lasting no more than four hours. And they’re packed with insights from key innovators and the latest tools and technologies to help you stay ahead of it all.
What you’ll learn and how you can apply it
- Discover the key skills data engineers use to design, build, and maintain the infrastructure necessary for data generation, storage, and processing
- Learn why data quality and data governance are necessary to ensure that data is clean and compliant
- Get tips on building your career amidst the rapid changes brought on by generative AI
This live event is for you because...
- You’re interested in transitioning to a career in data engineering.
- You’re a data professional who wants to discover skills gaps and upskill accordingly to move to the senior or staff level.
- You want to effectively approach the data lifecycle from ingestion to labeling to solving problems with machine learning.
- You want to better understand what work matters the most at every stage of your career and learn how to build the skills you need to support your journey.
Prerequisites
- Come with your questions
- Have a pen and paper handy to capture notes, insights, and inspiration
Recommended follow-up:
- Read Fundamentals of Data Engineering (book)
- Watch Introduction to the Fundamentals of Data Engineering (on-demand course)
- Read Designing Data-Intensive Applications (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Matt Housley: Introduction (5 minutes) - 8:00am PT | 11:00am ET | 3:00pm UTC/GMT
- Matt Housley welcomes you to the Data Superstream.
Andy Petrella–Keynote: The Vital Role of Data Engineering in the Age of Generative AI (15 minutes) - 8:05am PT | 11:05am ET | 3:05pm UTC/GMT
- While data engineering is frequently viewed as a backend process focused on merely moving data, Andy Petrella argues that it’s the backbone of today’s AI revolution. As the technology becomes more integrated into everyday life, data engineers face increasing responsibility to uphold data quality and governance, feeding AI systems with high-quality, reliable, and timely data needed to avoid AI hallucinations and to ensure the technology delivers meaningful value. He also addresses the broader implications of generative AI, emphasizing the importance of data integrity in preventing misinformation and manipulation. Join him to explore how essential data engineers are in shaping the future of AI.
- Andy Petrella is a thought leader, author, and entrepreneur in data and analytics. As the founder and CEO of Kensu, Andy has pioneered innovative approaches to data observability, helping organizations ensure the reliability and trustworthiness of their data pipelines. With a strong background in mathematics and software engineering, Andy has been instrumental in advancing the field of data management, particularly in addressing the challenges of data literacy, quality, and governance in complex environments.
Eevamaija Virtanen: Becoming a Data Engineer (30 minutes) - 8:20am PT | 11:20am ET | 3:20pm UTC/GMT
- Eevamaija Virtanen shares her journey into data engineering and provides authentic, real-world insights into creating your own path in the data field, from getting your first job to mastering essential technical skills. Whether you’re just starting out or refining your expertise, this down-to-earth talk reveals what it really takes to succeed and offers actionable steps and relatable advice to help you on your data engineering path.
- Eevamaija Virtanen is a senior data engineer and partner at Invinite, founder of the DataTribe community, and cofounder of Helsinki Data Week. With a diverse background spanning data engineering, project management, business development, and photography, she brings an innovative approach to solving complex data and business challenges. Eevamaija is a passionate advocate for continuous learning, cross-pollination, and collaboration, making her a key voice in the Nordic data community.
- Break (5 minutes)
Adi Polak: Stream All Things—Patterns of Data Stream Processing (30 minutes) - 8:55am PT | 11:55am ET | 3:55pm UTC/GMT
- The industry has had more than 10 years of attempts to solve the data streaming problem. Nevertheless, 80% of time spent in every project is devoted to optimizing the streaming data and analyzing windows. You want a service that is reliable, can handle all kinds of data and connect with all kinds of systems, and is easy to manage and scale as systems grow. And it should be super low latency too. Is that too much to ask? Adi Polak discusses the basic challenges of data streaming, introduces a few design and architecture patterns that can help, and explores how to implement them using Apache Flink. While there is no silver bullet for the data streaming problem, Adi shares some pragmatic solutions that have helped many organizations build fast, scalable, and manageable data streaming pipelines.
- Adi Polak is an experienced software engineer, people manager, and author of Scaling Machine Learning with Spark. For most of her professional life, she has dealt with data and machine learning for operations and analytics, developing algorithms to solve real-world problems using ML techniques and leveraging expertise in Apache Spark, Kafka, HDFS, and distributed large-scale systems. Adi has taught Spark to thousands of students and has recently begun a new adventure with data streaming—specifically Flink and ML inference—and is hooked.
Dunith Danushka: Toward a Composable Data Platform (Sponsored by Redpanda) (30 minutes) - 9:25am PT | 12:25pm ET | 4:25pm UTC/GMT
- Dunith Danushka explains what composable data platform architecture is, how this innovative approach can transform your data infrastructure, and why standardization is crucial for both technical efficiency and business agility. By adopting open standards, engineers can reduce vendor coupling and create more flexible, future-proof data platforms. You’ll explore standards like the Kafka protocol for streaming data, PostgreSQL wire protocol for OLTP and streaming databases, Streaming SQL for real-time ETL, dbt for batch ELT, Apache Iceberg as a table format, and Amazon S3 for static data storage. You’ll see real-world examples of these standards in action and learn how to transform a conventional data platform into a composable architecture.
- Dunith Danushka is senior developer advocate at Redpanda, where he spends most of his time educating developers on how to build event-driven applications. He has a passion for designing, building, and operating large-scale, real-time event-driven architectures and enjoys sharing his knowledge through blogging, videos, and public speaking.
- This session will be followed by a 30-minute Q&A in a breakout room. Stop by if you have more questions for Dunith.
- Break (5 minutes)
Colleen Tartow: Transitioning to a Career in Data Engineering (30 minutes) - 10:00am PT | 1:00pm ET | 5:00pm UTC/GMT
- There are many valid and nonstandard paths to becoming a data engineer, including academia, consulting, software engineering, and more. Colleen Tartow discusses the career journey as a compilation of emerging skills and explores a nontraditional path to data leadership.
- Colleen Tartow is field CTO and head of strategy at VAST Data. She’s been obsessed with data her entire life and has over 20 years of experience in data, advanced analytics, engineering, and consulting. Her work on data, engineering, analytics, and diversity issues has led to her speaking and mentoring in a variety of venues. Colleen holds a PhD in astrophysics.
Xinran Waibel: Path to Senior Data Engineer (30 minutes) - 10:30am PT | 1:30pm ET | 5:30pm UTC/GMT
- Are you a data engineer who wants to level up in your career? Drawing on lessons learned from her own career journey, Xinran Waibel describes the technical and soft skills you need to transition from junior or mid-level data engineer to a senior or staff role. Learn how to capture growth opportunities in your current role and explore strategies for growth beyond the workplace, including continuous learning and networking, to help you build a strong portfolio.
- Xinran Waibel is the founder of Data Engineer Things, a global community dedicated to creating and sharing learning resources for data engineering. She builds data applications to power ML algorithms and product innovation on the personalization data engineering team at Netflix. Previously, she was a data engineer at Confluent and Target, where she leveraged big data technologies to enable data-driven decision-making in the marketing and membership space.
- Break (5 minutes)
Holden Karau: Fighting Health Insurance with AI—E2E Model Training to Deployment (30 minutes) - 11:05am PT | 2:05m ET | 6:05pm UTC/GMT
- If you’ve ever had a health insurance claim denied, Holden Karau knows how you feel and has done something about it. She and others fine-tuned a model to generate health insurance appeals. Learn about her adventures using various cloud resources for fine-tuning and, ultimately, deploying on-premises Kubernetes, including the unexpected challenge of fitting graphics cards into servers.
- Holden Karau is a transgender Canadian open source developer focusing on Apache Spark and related big data tools. She has worked at Amazon, Apple, Google, Databricks, and IBM. She’s the coauthor of Learning Spark, High Performance Spark (working on a second edition), Scaling Python with Dask, and Scaling Python with Ray. She’s also a committer and PMC on Apache Spark. She was tricked into the world of big data while trying to improve search and recommendation systems. You can find her streaming some of her open source work on YouTube and Twitch in her “spare” time.
Jowanza Joseph: What LLMs Need to Do Great Data Analysis (30 minutes) - 11:35am PT | 2:35pm ET | 6:35pm UTC/GMT
- Jowanza Joseph delves into Parakeet’s analysis feature, a powerful tool that allows the company’s AI assistant, Rosella, to connect to permissioned data sources such as MQTT topics, files, relational databases, and buckets, and significantly accelerates the regulatory reporting process by automating complex data analysis. He discusses the challenges the company encountered while building the system, including the limitations and strengths of using large language models for data analysis and the essential role that robust data engineering plays in ensuring the reliability and scalability of the feature.
- Jowanza Joseph is the founder and CEO of Parakeet, a risk and compliance automation SaaS platform serving the manufacturing sector. He’s had a rich career in technology, working at Mastercard, Adobe, Pluralsight, Zagg, and Ascential, and has spoken at industry conferences including Strange Loop, Abstractions, and the O’Reilly data conference. Jowanza is also the author of Mastering Apache Pulsar, published by O’Reilly.
Matt Housley: Closing Remarks (5 minutes) - 12:05pm PT | 3:05pm ET | 7:05pm UTC/GMT
- Matt Housley closes out today’s event.
Your Hosts and Selected Speakers
Andy Petrella
Andy is an entrepreneur with a Mathematics and Distributed Data background focused on unleashing unexploited business potentials leveraging new technologies in machine learning, artificial intelligence, and cognitive systems.
In the data community, Andy is known as an early evangelist of Apache Spark (2011-), the Spark Notebook creator (2013-), a public speaker at various events (Spark Summit, Strata, Big Data Spain), and an O’Reilly author (Distributed Data Science, Data Lineage Essentials, Data Governance, and Machine Learning Model Monitoring).
Andy is the CEO of Kensu, bringing the Data Intelligence Management (DIM) Platform for data-driven companies to leverage AI sustainably, combining AI Observability with Data Usage Catalog.
Colleen Tartow
Colleen Tartow, PhD, has over 20 years of experience in data, advanced analytics, engineering, and consulting. Adept at assisting organizations in deriving value from a data-driven culture, she has successfully led large data, engineering, and analytics teams through the development of complex global data management solutions, and architecting front- and back-end SaaS and enterprise data systems. Colleen is also experienced in building and leading diverse teams through business reorganization and transforming existing data ecosystems, maturing them into modern and robust technology stacks. She is determined to make engineering organizations better for both humans and business through mentoring, leadership, and streamlining processes. Her demonstrated excellence in data and engineering leadership makes her a trusted senior advisor among executives, and her work on data, engineering, analytics, and diversity issues has led to her speaking at a variety of events in the technology leadership space and mentoring aspiring leaders in data and technology. Colleen holds an MS and PhD in astrophysics.
Matt Housley
Matt Housley, a data engineering consultant and cloud specialist, is cofounder of Ternary Data, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. Matt, Joe Reis, and their guests pontificate on all things data on The Monday Morning Data Chat.
Eevamaija Virtanen
Eevamaija Virtanen is a senior data engineer and partner at Invinite, founder of the DataTribe community, and cofounder of Helsinki Data Week. With a diverse background spanning data engineering, project management, business development, and photography, she brings an innovative approach to solving complex data and business challenges. Eevamaija is a passionate advocate for continuous learning, cross-pollination, and collaboration, making her a key voice in the Nordic data community.
Adi Polak
Adi Polak is an experienced software engineer, people manager, and author of Scaling Machine Learning with Spark. For most of her professional life, she has dealt with data and machine learning for operations and analytics, developing algorithms to solve real-world problems using ML techniques and leveraging expertise in Apache Spark, Kafka, HDFS, and distributed large-scale systems. Adi has taught Spark to thousands of students and has recently begun a new adventure with data streaming—specifically Flink and ML inference—and is hooked.
Dunith Danushka
Dunith Danushka is senior developer advocate at Redpanda, where he spends most of his time educating developers on how to build event-driven applications. He has a passion for designing, building, and operating large-scale, real-time event-driven architectures and enjoys sharing his knowledge through blogging, videos, and public speaking.
Xinran Waibel
Xinran Waibel is the founder of Data Engineer Things, a global community dedicated to creating and sharing learning resources for data engineering. She builds data applications to power ML algorithms and product innovation on the personalization data engineering team at Netflix. Previously, she was a data engineer at Confluent and Target, where she leveraged big data technologies to enable data-driven decision-making in the marketing and membership space.
Holden Karau
Holden Karau is a transgender Canadian open source developer focusing on Apache Spark and related big data tools. She has worked at Amazon, Apple, Google, Databricks, and IBM. She’s the coauthor of Learning Spark, High Performance Spark (working on a second edition), Scaling Python with Dask, and Scaling Python with Ray. She’s also a committer and PMC on Apache Spark. She was tricked into the world of big data while trying to improve search and recommendation systems. You can find her streaming some of her open source work on YouTube and Twitch in her “spare” time.
Jowanza Joseph
Jowanza Joseph is the founder and CEO of Parakeet, a risk and compliance automation SaaS platform serving the manufacturing sector. He’s had a rich career in technology, working at Mastercard, Adobe, Pluralsight, Zagg, and Ascential, and has spoken at industry conferences including Strange Loop, Abstractions, and the O’Reilly data conference. Jowanza is also the author of Mastering Apache Pulsar, published by O’Reilly.