AI Superstream: MLOps
Published by O'Reilly Media, Inc.
MLOps is consistently one of the greatest challenges engineers face when creating and maintaining machine learning systems. Join expert practitioners to learn techniques and best practices for operationalizing machine learning models and explore case studies of them in action, showing you what works—and what doesn't.
About the AI Superstream Series: This three-part series of half-day online events is packed with insights from some of the brightest minds in AI. You’ll get a deeper understanding of the latest tools and technologies that can help keep your organization competitive and learn to leverage AI to drive real business results.
What you’ll learn and how you can apply it
- Understand MLOps processes for model deployment, containerization, and automation as well as monitoring, continuous experimentation, and improvement
- Learn how an understanding of SRE and DevOps principles can enhance the practice of MLOps
- Avoid common pitfalls in the process of building end-to-end machine learning pipelines
This live event is for you because...
- You're a data or machine learning practitioner who puts machine learning models into production, or you’re embarking on an MLOps career path.
- You want to improve your process of productionizing machine learning models by applying new techniques and best practices.
Prerequisites
- Come with your questions
- Have a pen and paper handy to capture notes, insights, and inspiration
Recommended follow-up:
- Read Practical MLOps (book)
- Read Reliable Machine Learning (book)
- Watch Radar Talks: Hugo Bowne-Anderson on MLOps Versus DevOps (video)
- Take Practical MLOps (live online course with Noah Gift)
- Take Open Source MLOps in 4 Weeks (live online course with Alex Kim)
- Read Site Reliability Engineering (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Shingai Manjengwa: Introduction (5 minutes) - 8:00am PT | 11:00am ET | 4:00pm UTC/GMT
- Shingai Manjengwa welcomes you to the AI Superstream.
Susan Shu Chang: Keynote—MLOps from Good to Great (15 minutes) - 8:05am PT | 11:05am ET | 4:05pm UTC/GMT
- Machine learning capabilities help businesses create engaging products and relationships with customers. MLOps enables you to further scale these capabilities. Susan Shu Chang addresses the goal of building foundational and powerful MLOps in an organization and explains how to measure the efforts of going from good to great.
- Susan Shu Chang is a principal data scientist at Elastic, which powers search around the world. Previously, she shipped ML at scale in fintech and telecom. Susan is a five-time speaker at PyCons around the world and host of machine learning paper discussions on Aggregate Intellect, which has 17K+ subscribers on YouTube. She’s also the founder and sole developer of Quill Game Studios.
Olga Tsubiks: MLOps Culture for Continuous Experimentation (30 minutes) - 8:20am PT | 11:20am ET | 4:20pm UTC/GMT
- Most MLOps discussion focuses on model deployment, containerization, and automation—but what happens after the models are deployed? Olga Tsubiks untangles the mystery of the MLOps process used for designing machine learning experiments and helps you understand why continuous improvement often requires a cultural change. You’ll decipher MLOps concepts such as model drift and monitoring (which are useful for ML engineers and can help practitioners who are working closely with data scientists or those who aspire to build complex experimentation frameworks) and explore common pitfalls on the way to MLOps maturity. You’ll come away knowing how to navigate the challenges of managing ML pipelines and build a culture of continuous experimentation and improvement.
- Olga Tsubiks is director of advanced analytics and data science at the Royal Bank of Canada, where she’s responsible for the development and evolution of next-generation capacity modeling. A passionate AI/ML leader, she’s spent the last 15 years in various senior roles in data science, big data, data engineering, analytics, and data warehousing. She’s also worked on various data science and analytics challenges with global organizations such as the UN Environment Programme World Conservation Monitoring Centre, the World Resources Institute, and prominent Canadian nonprofits such as War Child Canada and Rainbow Railroad.
- Break (5 minutes)
Noah Gift: What Can MLOps Learn from the SRE Mindset? (30 minutes) - 8:55am PT | 11:55am ET | 4:55pm UTC/GMT
- One of the significant drivers of rapid adoption of MLOps is the need for the enterprise to show return on investment. Companies making substantial investments in data science often don’t get that return because their models don't make it to production. In many ways, this is similar to the era of "spaghetti coding" at the dawn of the internet, in which many companies faced significant outages, slow deliveries, and technical debt. The solution to the spaghetti coding problem has been the rigorous implementation of both site reliability engineering methodologies and DevOps. Automation is perhaps the most important of the SRE and DevOps principles for fixing software problems, but it’s also at the center of remedying issues in MLOps. Join Noah Gift for a retrospective of the origins of poorly maintained software systems and a discussion of how the fix—in the forms of SRE and DevOps—can teach new practitioners of MLOps many best practices that plug right into this emerging discipline.
- Noah Gift is lecturer and consultant in both the UC Davis Graduate School of Management’s MSBA program and Northwestern’s graduate data science program, MSDS, where he teaches and designs graduate machine learning, AI, and data science courses. He also consults for startups and other companies and is the author of close to 100 technical publications, including two books on subjects ranging from cloud machine learning to DevOps. Over his career, he’s served in roles ranging from CTO, general manager, and consulting CTO to cloud architect at companies including ABC, Caltech, Sony Pictures Imageworks, Disney Feature Animation, Weta Digital, AT&T, Turner Studios, and Linden Lab.
Jason Bell: Deployment and Metrics of Machine Learning Models with Kubernetes and Prometheus (30 minutes) - 9:25am PT | 12:25pm ET | 5:25pm UTC/GMT
- Find out why Kubernetes is perfect for MLOps as Jason Bell deconstructs the process of deploying a machine learning model into a Kubernetes cluster and accessing the model within the cluster to make a live prediction. Along the way, you’ll pick up strategies for dealing with model updates, updating models across the cluster, and monitoring their input and responses using Prometheus and Kubernetes.
- Jason Bell has been involved in software development for over 30 years. Since 2002, he’s specialized in the customer data journey with open source software. He’s the author of Machine Learning: Hands on for Developers and Technical Professionals and has contributed his expertise to several O’Reilly conferences and various open source projects. He created the Synthetica Data Platform to help companies and software developers create data for artificial intelligence model training and data for big data, streaming, and other systems. As an OpenUK Ambassador, Jason’s interested in the potential and ethical considerations of using open data and open source software for artificial intelligence. He also likes tea.
- Break (5 minutes)
Isabel Zimmerman: Composable Tools for Robust MLOps Deployment (30 minutes) - 10:00am PT | 1:00pm ET | 6:00pm UTC/GMT
- When building systems made to last, it’s important to choose the right tools for the job. But the MLOps landscape is broad, and there’s no one-size-fits-all approach to building an MLOps system, so you need to understand your options. Isabel Zimmerman takes you through some of the tools available in the open source space and explains how to use them in conjunction with each other to build a robust MLOps system.
- Isabel Zimmerman is an open source software engineer at Posit, where she works on a team building machine learning operations frameworks for Python and R models. She’s a passionate advocate for open source MLOps and international keynote speaker. When not online, she can be found baking vegan desserts or teaching her dog new tricks.
Todd Underwood: ML Model Quality as a Reliability Problem (30 minutes) - 10:30am PT | 1:30pm ET | 6:30pm UTC/GMT
- Model quality and performance is typically thought of as the model developer’s problem rather than a concern of ML production engineers. But changes in model quality represent the only truly end-to-end test of ML infrastructure, reliably identifying subtle problems in feature storage, metadata, model configuration, training, and serving. ML production engineers generally avoid directly measuring or responding to changes in model quality as an operational or reliability concern—but that needs to change. Todd Underwood elaborates on this perspective based on his experience at Google, explores some of the technical and cultural transformations the company had to make in order to close the model quality loop, and shares some of the results his teams have seen through this work.
- Todd Underwood is a senior director at Google leading reliability for machine learning. Google’s ML SRE teams build and scale internal and external AI/ML services and are critical to almost every product area at the company. Todd’s also the site lead for Google’s Pittsburgh office.
Shingai Manjengwa: Closing Remarks (5 minutes) - 11:00am PT | 2:000pm ET | 7:00pm UTC/GMT
- Shingai Manjengwa closes out today’s event.
Your Host
Shingai Manjengwa
Shingai Manjengwa is the head of AI education at ChainML, a tech startup that has developed an open source platform for the rapid and responsible development of generative AI tools. ChainML works with clients on AI education, adoption, and implementation from an AI product idea to an affordable and scalable deployment. A data scientist by profession, she led technical education at the Vector Institute for Artificial Intelligence in Toronto, where she translated advanced AI research into educational programming to drive AI adoption and innovation in industry and government. She also founded Fireside Analytics Inc., a data science education company that develops customized programs to teach digital and AI literacy, data science, bias and fairness in machine learning, and computer programming. Shingai’s book, The Computer and the Cancelled Music Lessons, teaches data science to kids ages 5 to 12. She also sits on the Service Advisory Committee for Employment and Social Development Canada and she’s a board member at the Institute on Governance. You can find Shingai on LinkedIn and X (Twitter) as @Tjido.