Chapter 9. Generating Audio

In Chapter 1, we caught a glimpse of the potential of audio generation with a transformers pipeline based on the MusicGen model by Meta. This chapter dives into generative audio, using both diffusion and transformer-based techniques, which will introduce a new set of exciting challenges and applications. Imagine if you could remove all background noise in real time during a call, if you could get high-quality transcriptions and summaries of conferences, or if a singer could regenerate their songs in other languages. You could even generate a theme of Mozart and Billie Eilish’s compositions that gets a mariachi-infused twist. Well, that’s the field’s trajectory, exciting times ahead.

What kinds of things can we do with ML and audio? The two most common tasks are transcribing speech to text (automatic speech recognition, or ASR) and generating speech from text (text to speech). In ASR, a model receives as input audio of someone (or multiple people) speaking and outputs the corresponding text. For some models, the output captures additional information, such as which person is speaking or the times when somebody said something. ASR systems are widely used, from virtual speech assistants to caption generators. Thanks to many open-access models made available to the public in recent years, there has been exciting research on multilingualism and running the models directly on edge.

In text to speech (TTS), a model generates synthetic and, hopefully, realistic ...

Get Hands-On Generative AI with Transformers and Diffusion Models now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.