Skip to content
  • Sign In
  • Try Now
View all events
Artificial Intelligence (AI)

Multimodal AI Essentials

Published by Pearson

Intermediate content levelIntermediate

Learn how multimodal AI merges text, image, and audio for smarter models

  • Deep Dive into Multimodal AI: Explore the cutting-edge of AI by learning how multimodal systems combine text, image, and audio data to create more sophisticated and intuitive AI models.
  • Hands-on Visual Question Answering Project: Transition from theory to practice with a hands-on project focused on building a Visual Question Answering (VQA) model using open-source components.
  • Multidisciplinary Approach: By incorporating examples from various fields such as healthcare, marketing, and entertainment, the course ensures that participants from all backgrounds can see the relevance and potential impact of multimodal AI in their respective areas.

This course offers a comprehensive pathway to understanding and applying multimodal AI, emphasizing a hands-on approach to learning. Multimodal AI refers to AI models and algorithms that integrate and process at least two different types of data simultaneously, such as images and text, or video and audio. Participants will gain not only theoretical insights into how different AI modalities can be integrated for enhanced performance but also practical experience in developing a multimodal AI project from scratch. Whether you are a data scientist, a software developer looking to expand your AI toolkit, or a business professional curious about the potential of AI to transform your industry, this course is designed to equip you with the knowledge and skills to explore the possibilities of multimodal AI.

What you’ll learn and how you can apply it

  • The Fundamentals of Multimodal AI: Gain a solid understanding of what multimodal AI is, including how it integrates various types of data (text, images, audio) to create AI models that more closely mimic human sensory and cognitive processes. Learn about the different components that make up multimodal systems, including input, fusion, and output modules, and how they work together to process complex data.
  • How to Build a Visual Question Answering (VQA) Model: Acquire hands-on experience in constructing a VQA model from scratch using open-source libraries and frameworks. Understand the process of integrating text and image data to develop a system capable of answering questions about visual content, highlighting the practical applications of multimodal AI in real-world scenarios.
  • Applying Multimodal AI in Diverse Fields: Discover how multimodal AI can be applied across various domains such as healthcare, customer service, education, and entertainment. Learn to identify opportunities for leveraging multimodal AI to solve specific challenges and enhance user experiences in your field of interest or work.
  • Navigating Open-Source Tools and Resources: Become proficient in utilizing open-source platforms and tools that support multimodal AI development, including Hugging Face's Transformers library. Understand how to access, modify, and deploy pre-trained models for custom multimodal AI applications, and how these resources can accelerate your AI projects.

This live event is for you because...

  • You're Exploring the Frontiers of AI: Tailored for innovators across fields -- whether you're a marketer, educator, entrepreneur, or artist intrigued by the integration of technology and creativity. This course demystifies multimodal AI, making it accessible for those eager to understand how combining data types can revolutionize interactions and services.
  • You're a Developer or Data Scientist Seeking Multimodal AI Expertise: Designed for professionals with a foundation in software development or data science, this event pushes the envelope, introducing you to the cutting-edge of AI technology. Elevate your skill set by learning how to build and implement Visual Question Answering systems and other multimodal AI applications.
  • You Aspire to Implement AI Innovations: Perfect for those looking to harness AI's potential to solve complex problems, enhance user experiences, or create new products. Gain hands-on experience in applying AI to diverse scenarios, from enhancing e-commerce with image and text analysis to developing interactive educational tools.
  • You Believe in the Power of Collaboration: This course celebrates the shared progress that comes from collaboration, offering a gateway to join a community of like-minded individuals. It's an exceptional opportunity for those who thrive on collective wisdom and are keen to both contribute to and benefit from the vast pool of knowledge and resources in the AI domain.

Prerequisites

To ensure you can dive straight into the heart of the course content, you should come prepared with:

  • Intermediate - Advanced Python Skills: Comfort with Python is crucial as we'll be using it throughout the course to interact with Hugging Face tools and integrate NLP into practical examples.
  • Foundational Machine Learning Knowledge: You should have an understanding of core machine learning principles, as we’ll build upon these concepts when exploring advanced NLP techniques.

Course Set-up

  • Python Environment: Install Python on your machine. We recommend using the Anaconda distribution as it conveniently bundles Python with Jupyter notebooks and other data science tools.
  • Internet Connection: Ensure you have a reliable internet connection to download course materials and access online resources during the course.
  • Course Materials on GitHub: This repository will contain all the code, datasets, and additional materials you'll need (https://github.com/sinanuozdemir/oreilly-multimodal-ai).

Recommended Preparation

Recommended Follow-up

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Segment 1: Understanding Multimodal AI: Foundations and Frameworks (60 minutes)

  • Introduction to the concept of multimodal AI and its significance in advancing AI's capabilities.
  • Exploring the types of data (text, images, audio) used in multimodal AI and how they are integrated.
  • Q&A + Break

Segment 2: Navigating Open Source Tools for Multimodal AI Development (60 minutes)

  • Overview of key open-source libraries and frameworks that support multimodal AI, with a focus on tools available on platforms like Hugging Face.
  • Exercise: Exploring and applying pre-trained multimodal models from Hugging Face
  • Q&A + Break

Segment 3: Building a Visual Question Answering (VQA) Model (60 minutes)

  • Step-by-step guide on constructing a VQA model using open-source components and datasets.
  • Exercise: Participants will work on integrating text and image data inputs to start building a basic VQA model.
  • Q&A + Break

Segment 4: Building Multimodal Applications and Community Engagement (60 minutes)

  • Discussing real-world applications of multimodal AI across different industries and sectors.
  • Leveraging the AI community for collaboration, project enhancement, and staying updated with the latest multimodal AI advancements.
  • Q&A + Break
  • Course Wrap-Up and Next Steps

Your Instructor

  • Sinan Ozdemir

    Sinan Ozdemir is founder and CTO of LoopGenius, where he uses state-of-the-art AI to help people create and run their businesses. He has lectured in data science at Johns Hopkins University and authored multiple books, videos and numerous online courses on data science, machine learning, and generative AI. He also founded the recently acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. Sinan most recently published Quick Guide to Large Language Models, and launched a podcast audio series, AI Unveiled. Ozdemir holds a master’s degree in pure mathematics from Johns Hopkins University.

    linkedinXlinksearch