Multimodal Machine Learning

Beginner to intermediate

Handling images, language, and speech

This live event utilizes Jupyter Notebook technology

In this course, you’ll:

Learn how to use a multimodal framework (MMF) for image and text applications
Understand real-world applications involving multiple data sources
Identify where multimodal learning could support your application

Course description

Humans use multiple senses—visual, speech, touch, taste, and audio—to form a holistic view of the world. Similarly, in machine learning, combining several sources of input (such as text, audio, and images) or modalities may benefit our prediction problem. For example, images could be generated from text, video data (comprised of audio, video, and text) could be used for sentiment classification and action recognition, and audio/textual description of images could be generated. For all such applications, methods and models are needed that can handle multiple types of information at the same time.

Join expert Purvanshi Mehta to gain an understanding of what multimodal learning is and the latest techniques for handling multiple sources of information. You’ll understand how to use the multimodal framework (MMF) for classification using images and text, build your own multimodal fusion model from scratch, use CLIP for image search, and DALL-E for image generation from text. Then, from preprocessing to state-of-the-art models (Perceiver and Data2Vec), you’ll also learn how to form video encodings for human action classification tasks.

What you’ll learn and how you can apply it

Explore multimodal learning and how to apply it
Understand and build real-world applications involving multiple data sources
Use HuggingFace for multimodal learning
Use CLIP to do an image search
Use DALL-E to generate images

This live event is for you because...

You’re a data scientist, ML engineer, or applied ML scientist interested in learning how to process information from multiple sources.
You already know the basics of multimodal learning but want to dive deeper into it.
You want to explore a new field in ML or build cutting-edge models that can handle multiple sources of information.
You want to catch up with state-of-the-art models in speech, image, and text understanding.

Prerequisites

A basic understanding of deep learning
Familiarity with the PyTorch library

Recommended follow-up:

Read Multimodal Scene Understanding (book)
Read “Multimodal Machine Learning: A Survey and Taxonomy” (article)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction to multimodal learning basics and implementation using MMF (60 minutes)

Presentation: What is multimodal learning, and why do we need it?; core challenges definition (representation learning, translation, alignment, fusion, colearning); introduction to Pytorch Multimodal Framework; using text and image to classify a meme as hateful
Jupyter notebooks: Use a Pretrained Model with MMF; Build a Model from Scratch Using MMF and PyTorch
Q&A
Break

Different applications and models of multimodal learning (60 minutes)

Presentation: Visual generation from text (DALL-E); vision and text classification tasks using CLIP; prompt engineering; searching for an image using text
Jupyter notebooks: Use DALL-E for Image Generation; Zero-Shot Image Classification with CLIP
Break

Other multimodal models (55 minutes)

Presentation: Perceiver model for encoding videos; Do we need different models for processing different data types?; Data2Vec introduction and applications; HuggingFace for multimodal learning
Jupyter notebook: Classify Human Action

Wrap-up and Q&A (5 minutes)

Your Instructor

Purvanshi Mehta
Purvanshi Mehta is an applied scientist on the graph intelligence science team at Microsoft, where she works with multimodal data—images, text, and discrete data—daily, providing efficient solutions to partner Office 365 teams. She’s also a volunteer manager for Microsoft’s AI for Good initiative, which involves partnering with Innovations for Poverty Action (IPA) to improve the Poverty Probability Index (PPI) by using multimodal data. Previously, she worked at Amazon Lab126 on the Alexa AI natural language understanding team, and at Lulea Technical University in Sweden and TU Kaiserslautern in Germany on various aspects of deep learning. An article on her thesis, “Interpretability in Multimodal Deep Learning,” has been read by more than 70K people on Medium. She’s also presented papers on probabilistic deep learning, graph learning, arithmetic word-problem solving, and language processing at NeurIPS and WSDM.

linkedin link search