Skip to content
  • Sign In
  • Try Now
View all events
GPT

AI Superstream: Multimodal Generative AI

Published by O'Reilly Media, Inc.

Beginner to advanced content levelBeginner to advanced

Evolving to more powerful and expressive systems

While large language models are groundbreaking tools for automating everyday text-based tasks such as text summarization, translation, and generation, we've also seen the emergence of more complex generative AI models that can process and output different types of data, such as images, audio, and even video. Multimodal AI models, such as GPT-4, are capable of working across different data formats, for example, to generate speech from text, text from images, or text from audio. By combining different modalities, multimodal AI can interact with humans in more natural, intuitive ways, mimicking how humans perceive and understand the world around them. The possibilities from processing inputs more holistically and providing more intuitive outputs are already nudging us closer to true artificial general intelligence.

Join trailblazers and experienced industry experts working with the latest possibilities in AI to create interactions that feel natural, intuitive, and comprehensive.

About the AI Superstream Series: This three-part series of half-day online events is packed with insights from some of the brightest minds in AI. You’ll get a deeper understanding of the latest tools and technologies that can help keep your organization competitive and learn to leverage AI to drive real business results.

What you’ll learn and how you can apply it

  • Design more natural, human-like interactions between AI systems and users by leveraging multimodal capabilities
  • Explore fundamental mathematical concepts like multimodal alignment and fusion, heterogeneous representation learning, and multistream temporal modeling
  • Review practical applications such as advanced voice assistants, smart home systems, and virtual shopping experiences

This live event is for you because...

  • You're a current or future AI product owner or AI/machine learning practitioner.
  • You want to learn about the state of the art in artificial intelligence and how large language models can be leveraged to build new applications and solve your organizational challenges.

Prerequisites

  • Come with your existing knowledge of machine learning and AI and your questions
  • Have a pen and paper handy to capture notes, insights, and inspiration

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Susan Shu Chang: Introduction (5 minutes) - 8:00am PT | 11:00am ET | 3:00pm UTC/GMT

  • Susan Shu Chang welcomes you to the AI Superstream.

Antje Barth–Keynote: Recent Breakthroughs in Multimodal Generative AI (15 minutes) - 8:05am PT | 11:05am ET | 3:05pm UTC/GMT

  • As generative AI continues to evolve at a rapid pace, the paradigm is shifting toward multimodal models that can seamlessly integrate and generate across various forms of data. Antje Barth explores the cutting-edge developments in multimodal generative AI, examining how these models are bridging the gap between text, images, audio, and even 3D content. You’ll dive into recent breakthrough models, their architectural innovations, and real-world applications across industries.
  • Antje Barth is a principal developer advocate for generative AI at Amazon Web Services. She’s also coauthor of the O’Reilly books Generative AI on AWS and Data Science on AWS. A frequent speaker at AI and machine learning conferences and meetups around the world, she cofounded the global Generative AI on AWS Meetup and the Düsseldorf chapter of Women in Big Data. Previously, Antje worked in solutions engineering roles at MapR and Cisco, helping developers leverage big data, containers, and Kubernetes platforms in the context of AI and machine learning.

Rikin Gandhi: How We Built Farmer.Chat, a Multimodal GenAI Assistant (30 minutes) - 8:20am PT | 11:20am ET | 3:20pm UTC/GMT

  • Farmer.Chat, an AI assistant for agriculture, puts world-class advice in the hands of extension workers and farmers, helping them react to changing climate patterns, soil conditions, and market cycles with the power of generative AI. Trained on validated content, including a library of 8,000+ locally produced videos by and for farmers, and integrated with agricultural research models and dynamic weather and market data, Farmer.Chat empowers users to get advice in real time on popular messaging platforms like WhatsApp and Telegram. Rikin Gandhi shares the story of how this groundbreaking AI assistant was created and how it helps shape a future where all farmers can access critical information and make the best decisions to optimize their farms’ productivity.
  • Rikin Gandhi, CEO and cofounder of Digital Green, has a background in computer science and aerospace engineering from Carnegie Mellon University and MIT. He has worked at Oracle and Microsoft Research and has coauthored significant works on agrifood innovation systems. Since setting up as a nonprofit in 2008, Digital Green has been improving the cost-effectiveness of agricultural extension in South Asia and Africa with videos by and for farmers serving 100,000 extension agents and six million farmers, and now is leveraging generative AI to flip the paradigm of extension for women farmers at the frontlines of climate change.
  • Break (5 minutes)

Suhas Pai: Evaluation of Multimodal Systems (30 minutes) - 8:55am PT | 11:55am ET | 3:55pm UTC/GMT

  • Until recently, all multimodal models processed each modality (text, images, or speech) separately, utilizing modality-specific architectural components. This is no longer the case with newer models like Chameleon, which represent all modalities with discrete tokens, using the same architecture end-to-end. While advances like these make multimodal models easier to adopt, evaluation of these systems comes with unique challenges. Suhas Pai explores multimodal model evaluation, with a focus on evaluating cross-modal interactions and information flow, combining information from multiple modalities, and multimodal downstream tasks. You’ll hear about existing evaluation benchmarks and their limitations and come away with tips for developing your own internal multimodal benchmarks.
  • Suhas Pai is an NLP researcher and cofounder/CTO at Hudson Labs, a Toronto-based startup. He’s writing the book Designing Large Language Model Applications, now in early release for O’Reilly Media. Suhas has led and contributed to various open source models, including as co-lead of the privacy working group at BigScience, part of the BLOOM open source LLM project. He’s also active in the ML community as chair of the Toronto Machine Learning Summit conference since 2021 and NLP lead at Aggregate Intellect.

Omar Aldughayem: Enhancing Telecom Customer Service with Multimodal AI-Powered Chatbots (Sponsored by Mobily) (30 minutes) - 9:25am PT | 12:25pm ET | 4:25pm UTC/GMT

  • Omar Aldughayem delves into the application of multimodal AI-powered chatbots in the telecom industry, exploring how these systems integrate NLP, speech recognition, and image analysis to revolutionize customer interactions. He examines the evolution of multimodal AI technologies and their ability to process and interpret various data types simultaneously, providing seamless and context-aware responses. He also shares insights into the technical implementation of chatbots, focusing on the integration of NLP, sentiment analysis, and real-time problem-solving across text, voice, and visual inputs. You’ll learn practical use cases that demonstrate the impact of multimodal AI on reducing response times, enhancing 24/7 support capabilities, and improving overall customer satisfaction through sophisticated, data-driven interactions.
  • Omar Aldughayem is an advanced analytics expert at Mobily, working on projects that integrate machine learning to enhance customer experience, optimize debt collection processes, and deliver deep insights into customer behavior. He has extensive experience in deploying AI-driven solutions in the telecommunications and oil and gas industries. Prior to Mobily, Omar worked with RPDC on the implementation of reinforcement learning for industrial robotics systems. He has also played a pivotal role in the development of open source robotics systems as the cofounder and vice chair of RoboTweak, where he continues to drive innovation in AI applications for robotics. Omar holds a PhD in electrical and electronic engineering from the University of Manchester.
  • This session will be followed by a 30-minute Q&A in a breakout room. Stop by if you have more questions for Omar.
  • Break (5 minutes)

Nahid Alam: Unveiling the Edge of Generative AI—Resource, Cost, and Performance Trade-Offs for Multimodal Foundational Models (30 minutes) - 10:00am PT | 1:00pm ET | 5:00pm UTC/GMT

  • Multimodal foundational models, capable of processing diverse data modalities (text, image, speech), have revolutionized AI's ability to understand the world. However, their computational demands often necessitate deployment on powerful cloud infrastructure. Nahid Alam explores the emerging paradigm of deploying these models directly on resource-constrained edge devices and delves into the intricate relationship between resource limitations, cost considerations, and attainable performance when deploying cloud, local, hybrid, and edge-based multimodal generative AI solutions.
  • Nahid Alam has spent the last decade working with Fortune 500 companies and bringing innovative ideas to businesses. She’s a staff AI engineer at Cisco Meraki, focusing on computer vision for security cameras. Her recent work looks into developing techniques for identifying key moments in long-form videos. She has authored works on deep learning, and owns a few technical disclosures and a US patent. Nahid holds bachelor’s and master’s degrees in computer science.

Anthony Susevski and Andrei Betlen: Quickly POCing Multimodal LLMs, Even on a ThinkPad (30 minutes) - 10:30am PT | 1:30pm ET | 5:30pm UTC/GMT

  • Anthony Susevski and Andrei Betlen explore the latest advancements in vision-language models and their applications in the enterprise, including the capabilities of current models, hardware requirements, and common cloud infrastructure challenges. They illustrate the practical value of these models with a real-world use case, discuss the benefits of conducting local proofs of concept, and demonstrate a vision-language model running on a ThinkPad to highlight the feasibility of deploying advanced AI on everyday hardware.
  • Anthony Susevski is a data scientist at RBC Capital Markets and a senior data quality specialist for Cohere. He spends most of his (limited) free time getting as involved as possible in the expanding machine learning space. He also contributes to Hugging Face community events, participates in the Cohere4AI community, and makes YouTube videos on various topics.
  • Andrei Betlen is an experienced software engineer with a background in computer vision, delivering R&D solutions in healthcare and defense. He’s also a contributor to the popular open source project llama.cpp which allows anyone to run large language models on consumer hardware.
  • Break (5 minutes)

Shekhar Iyer: Risky Business—How to Protect GenAI Applications from Security and Safety Risks (30 minutes) - 11:05am PT | 2:05pm ET | 6:05pm UTC/GMT

  • The last two years have seen remarkable advancements in GenAI, the most recent of which has been the release of multimodal models. While these frontier models are incredibly capable, they also introduce new safety and security risks. As developers harness multimodal models to power innovative applications, they may inadvertently expose their company to these risks despite the internal guardrails baked into these models. Shekhar Iyer reviews the top AI threats using real-world examples, explores what’s required to meet emerging standards and regulations, and provides a model-agnostic framework that you can use throughout the AI development lifecycle to protect your multimodal applications.
  • Chandrasekhar Iyer is director of AI at Robust intelligence, where his focus is on building a comprehensive threat discovery and intelligence framework to understand new and emerging AI security risks, technology that helps make AI adoption safe for companies, and deep research around offensive and defensive AI security. Previously, he led ML efforts in the areas of information retrieval, conversational AI, graph learning, and risk detection at companies including Amazon, Meta, and Stripe.

Chris Fregly: Beyond LLMs—Mastering Multimodal RAG for Engaging Generative AI Applications (30 minutes) 11:35am PT | 2:35pm ET | 6:35pm UTC/GMT

  • Chris Fregly describes and demonstrates the cutting-edge world of multimodal retrieval-augmented generation (RAG), focusing on its applications in processing diverse data types such as images, audio, and video. Developers and AI enthusiasts looking to expand their knowledge of advanced RAG implementations will gain insights into multimodal embeddings, vector stores, and practical use cases as well as techniques for integrating images, audio, and video into RAG systems.
  • Chris Fregly is a principal solutions architect for generative AI at Amazon Web Services. He holds the full complement of AWS certifications and is coauthor of Generative AI on AWS and Data Science on AWS, both for O’Reilly. A cofounder of the global Generative AI on AWS Meetup, he often speaks at AI and machine learning meetups and conferences across the world. Previously, Chris was an engineer at Databricks and Netflix, where he worked on scalable big data and machine learning products and solutions.

Jingying Gao: Teaching AI to Solve Complex Logical Reasoning Using Multimodal Models (30 minutes) - 12:05 PT | 3:05pm ET | 7:05pm UTC/GMT

  • Human beings rely on their own multimodal intelligence to understand and apply reason to the world around us. This approach is also crucial for AI, as it enables machines to learn and reason from multiple sources of information. Visual question answering (VQA) requires multimodal AI because it involves understanding and answering questions about visual content, bridging the gap between visual perception, language understanding, common sense, and logical reasoning. State-of-the-art large multimodal models aim to tackle such reasoning problems, but existing methods, which often rely on deep neural networks to learn implicit representations from data, lack the capacity to reason complex logical problems and provide interpretable explanations.
  • Jingying Gao is a senior manager for AI and data in Commonwealth Bank of Australia’s AI Labs, where she has led generative AI research and applied AI projects. She’s a passionate AI scientist with over eight years of commercial and academic AI experience, preceded by years of R&D leadership, technical project management, and consulting experience at Edenred, Ernst & Young, HP, and Sony. Jingying is also passionate about robotics; she cofounded an AI-centered robotics startup where she led R&D teams to design, develop, and deliver two family service robots called Yiling and Yiyi. She earned a PhD at the University of New South Wales, focusing on multimodal AI and explainable AI.

Susan Shu Chang: Closing Remarks (5 minutes) - 12:35pm PT | 3:35pm ET | 7:35pm UTC/GMT

  • Susan Shu Chang closes out today’s event.

Your Hosts and Selected Speakers

  • Susan Shu Chang

    Susan Shu Chang is a principal data scientist at Elastic (of Elasticsearch), with previous ML experience in fintech, telecommunications, and social platforms. Susan is an international speaker, with talks at six PyCons worldwide and keynotes at Data Day Texas, PyCon DE & PyData Berlin, and O’Reilly’s AI Superstream. She writes about machine learning career growth in her newsletter, susanshu.substack.com. In her free time she leads a team of game developers under Quill Game Studios, with multiple games released on consoles and Steam.

    linkedinXlinksearch
  • Chris Fregly

    Chris Fregly is a San Francisco, California-based developer advocate for AI and machine learning at Amazon Web Services (AWS). He’s worked with Kubeflow and MLflow since 2017 and founded the global Advanced Kubeflow Meetup. Chris regularly speaks at ML/AI conferences across the world, including the O’Reilly AI and Strata Data Conferences. Previously, Chris was founder at PipelineAI, helping startups and enterprises continuously deploy AI and machine learning pipelines using Kubeflow and MLflow, and was an ML-focused engineer at both Netflix and Databricks.

    linkedinXlinksearch
  • Antje Barth

    Antje Barth is a principal developer advocate for AI and machine learning at AWS. She’s the coauthor of Data Science on AWS and frequently speaks at AI and machine learning conferences, online events, and meetups around the world. Antje is also passionate about helping developers leverage big data, container, and Kubernetes platforms in the context of AI and machine learning. She’s cofounder of the Düsseldorf chapter of Women in Big Data.

    linkedinXlinksearch
  • Rikin Gandhi

    Rikin Gandhi, CEO and cofounder of Digital Green, has a background in computer science and aerospace engineering from Carnegie Mellon University and MIT. He has worked at Oracle and Microsoft Research and has coauthored significant works on agrifood innovation systems. Since setting up as a nonprofit in 2008, Digital Green has been improving the cost-effectiveness of agricultural extension in South Asia and Africa with videos by and for farmers serving 100,000 extension agents and six million farmers, and now is leveraging generative AI to flip the paradigm of extension for women farmers at the frontlines of climate change.

  • Suhas Pai

    Suhas Pai is an NLP researcher and cofounder/CTO at Hudson Labs, a Toronto-based startup. He’s writing the book Designing Large Language Model Applications, now in early release for O’Reilly Media. Suhas has led and contributed to various open source models, including as co-lead of the privacy working group at BigScience, part of the BLOOM open source LLM project. He’s also active in the ML community as chair of the Toronto Machine Learning Summit conference since 2021 and NLP lead at Aggregate Intellect.

  • Nahid Alam

    Nahid Alam has spent the last decade working with Fortune 500 companies and bringing innovative ideas to businesses. She’s a staff AI engineer at Cisco Meraki, focusing on computer vision for security cameras. Her recent work looks into developing techniques for identifying key moments in long-form videos. She has authored works on deep learning, and owns a few technical disclosures and a US patent. Nahid holds bachelor’s and master’s degrees in computer science.

  • Anthony Susevski

    Anthony Susevski is a data scientist at RBC Capital Markets and a senior data quality specialist for Cohere. He spends most of his (limited) free time getting as involved as possible in the expanding machine learning space. He also contributes to Hugging Face community events, participates in the Cohere4AI community, and makes YouTube videos on various topics.

  • Andrei Betlen

    Andrei Betlen is an experienced software engineer with a background in computer vision, delivering R&D solutions in healthcare and defense. He’s also a contributor to the popular open source project llama.cpp which allows anyone to run large language models on consumer hardware.

  • Shekhar Iyer

    Chandrasekhar Iyer is director of AI at Robust intelligence, where his focus is on building a comprehensive threat discovery and intelligence framework to understand new and emerging AI security risks, technology that helps make AI adoption safe for companies, and deep research around offensive and defensive AI security. Previously, he led ML efforts in the areas of information retrieval, conversational AI, graph learning, and risk detection at companies including Amazon, Meta, and Stripe.

  • Jingying Gao

    Jingying Gao is a senior manager for AI and data in Commonwealth Bank of Australia’s AI Labs, where she has led generative AI research and applied AI projects. She’s a passionate AI scientist with over eight years of commercial and academic AI experience, preceded by years of R&D leadership, technical project management, and consulting experience at Edenred, Ernst & Young, HP, and Sony. Jingying is also passionate about robotics; she cofounded an AI-centered robotics startup where she led R&D teams to design, develop, and deliver two family service robots called Yiling and Yiyi. She earned a PhD at the University of New South Wales, focusing on multimodal AI and explainable AI.

Sponsored by

  • Mobily logo