Chapter 1. The State of Video

I’ve been working in online video for about 25 years now, and most of that time on the engineering side of content creation software. I’ve witnessed a breathtaking evolution from grainy postage stamp-sized experiments to a deluge of high-definition content that consumes an ever-increasing amount of people’s lives, both privately and professionally. But it’s striking to me how video remains fundamentally rigid as a medium.

There are three separate spheres of digital video, and the long-standing status quo is that these three spheres only interact to the minimum extent necessary:

Production: Here is the realm of the creatives—everyone from marketing video freelancers to YouTube vloggers to Oscar-winning film editors. Their tools are digital but not cloud native. As a result, the creative process is like a busy factory on an island that ships out a single, tightly packed parcel of finished content to the next sphere.
Video distribution: This is a massive industry encompassing everything from TikTok to Netflix and consuming the lion’s share of the entire internet’s bandwidth today. It’s operated by cloud specialists who have almost nothing to do with video production specialists. The strict separation between the two spheres effectively ensures that video in distribution almost never gets creatively altered—at most, ads are inserted (usually in ways that annoy the viewer because ad platforms know very little about the actual content of the videos).
Real-time video communications: This means everything from Zoom-style video calls to webinars to interactive live streaming. I’m going to talk a lot more about this sphere presently because, although these meetings and live streams can seem mundane, they’re actually a cloud native form of video production.

This report is really about breaking down the boundaries between the three spheres and how the latest breakthroughs in AI can help with that. Could video be something richer and smarter than today’s fixed formats, which at best get transcoded for delivery? Could we make something creative or useful out of all the video data that flows through real-time video communications systems? Can editing methods escape the production island and be automated in the cloud?

Why Video Should Become Fluid

Another way to state this goal is that we want to turn video into a fluid medium. Computers are good at this. Fifty years ago text and forms were the stiff substance of office paperwork and dusty archives, but now language and user interfaces can be endlessly generated and modified on the fly. The same can happen to video.

It’s not easy because video is traditionally the most difficult type of digital media to process. And without processing, its potential is often wasted. Live streams and video meetings remain aimless, the recordings too tedious to watch. Refining rich and meaningful video content out of this mother lode of daily video has been practically impossible because it was a slow manual process involving video editors and designers toiling away on high-end computers. Until now…

Generative AI is revolutionizing video processing too, but not in the way you might think. Flashy and expensive AI outputs grab the headlines, but improving the types of video we interact with every day often requires a different approach. You probably don’t need a live Marvel character rendered into your meetings. But everyone could use automatic tools that give your videos more context, smart editing, useful information density, and higher production values.

Why Real-Time Video Is Interesting

Let’s look more closely at the definition of real-time video, the third sphere on my previous list.

Often a line is drawn between “streaming” and “video calls.” The most obvious difference is in audience size and access: video calls are defined as being private or shared within small groups, while streaming means broadcasts. Streaming can be further divided into two categories:

Video-on-demand (VOD): This is the original wave of online video: YouTube, Netflix, and of course endless amounts of more niche content.
Live streaming: This includes both large-audience broadcasts like YouTube Live and Twitch, but also private streams like industrial camera feeds.

From an implementation perspective, VOD is fundamentally different from live streaming because VOD is offline content distribution. There’s no time pressure in preparing video files for VOD. It’s possible to make multiple encodings so that each device can be served an optimized version of the video content, and these files can be cached on servers around the globe for fast access. The greatest engineering challenges in VOD are primarily about cloud content distribution and are thus mostly not specific to video.

In contrast, live streaming is all about optimizing latency—in other words, the time elapsed between when the streamer sends a frame of video and when the viewers receive it. Lower latency is a better experience. But a somewhat surprising feature of modern live streaming is that latency has trended upwards, even though you might expect the opposite. When you stream to a platform like YouTube Live, the latency can be anywhere between 15 to 60 seconds. This also varies between clients, so your viewers aren’t necessarily synchronized.

Note

The high latency is because live streaming platforms use VOD-style techniques to achieve the “last mile” of content distribution to potentially millions of simultaneous viewers. For more detail, see Mux’s description of its setup.

The crucial point about real-time video is that latency matters. For VOD, that’s not the case. But for live streaming and video calls, it’s an essential parameter of the user experience, on par with picture and sound quality.

The level of interactivity defines the tolerance for latency. It’s practically impossible to carry a voice conversation if the call latency is more than a fraction of a second, whereas watching a live concert of Beyoncé doesn’t feel seriously “off” even if the stream has a 30-second delay. However, this latency may become apparent and distracting if someone physically in the audience were texting you about the concert, for example. In this way, out-of-band interactivity increases demand for better latency.

Real time has a much more stringent definition in computer science. But for our purposes, real time is not a technical feature but a user experience feature—it’s defined only by the perception of people occupying the same time via video, not any specific latency boundary. As the previous concert example illustrates, that perception can shift.

Seen in this context, a Twitch live stream and a Zoom conference call are not very different animals—they just make different trade-offs on how to optimize between audience size, image quality, and latency. Both are actually real-time video productions at their essence.

What Holds the Content Together

Real-time video is bound in time and space by a concept whose name varies depending on the application and vertical. It can be a meeting, a video call, a live stream, a webinar, or an online event. Some companies even try to come up with their own brandable name, often with less satisfactory results (for instance, Slack calls it a “huddle”). But we’ll settle for “video session” as the most generic option.

Remember in the previous section when I defined real-time online video experience as being based on the perception of shared time? Using this framework, the session is the means that gives a set of participants access to that shared context of perceived time.

Again, this is based on user experience rather than a technical definition. We don’t need to make specific claims about latency, persistent network connections, or any other implementation feature. If it feels real-time and connects people via video, it’s a session.

At its most simple, the session is one-to-one between two participants, which is just a video call. Even this deceptively ordinary experience offers many opportunities for product evolution beyond “telephone call but with picture.” For example, in a telehealth application, the sessions may be augmented with various data views such as displaying laboratory results to the patient, displaying patient records to the doctor, and so on. AI can take it further by analyzing and enhancing images during the call, providing summaries on the fly, and doing all sorts of processing to help the doctor after the session has ended.

In this example, we already saw a major feature of real-time video sessions: the presence of application-specific participant roles, namely doctor/patient. Another familiar example would be teacher/student. But someone designing an education application will probably have to consider more roles than just those two. In addition to one teacher and several students participating live, there could also be facilitators, guest lecturers, smaller workgroups, and passive viewers.

Participant roles as a product feature are often intertwined with the essential question of high versus low latency. A participant on a high-latency connection will have much more limited opportunities for interaction, and a viewer of a recording, of course, has none. It’s a core challenge for product designers working in this space to ensure a good user experience for all these roles and their interaction modes.

Note

There is an emerging product category of interactive live streaming (ILS), which rearranges the balance to provide live streaming-style broadcast experiences with the low latency of video calls. If your product or use case fits into this bracket, it’s probably also a good candidate for AI-enabled postproduction techniques.

Rich interaction within a session makes it better for the participants themselves, but it’s also interesting for the purposes of this report because it provides structured data for an AI to chew on.

What You Could Be Producing

The real-time video session’s core strength is also its fundamental limitation: it exists in time and evolves in real time. When it ends, what’s left but memories? If we want more, we need to produce something that lasts—call them artifacts.

Artifacts are the nontransient outputs derived from a session. I realize that sounds perhaps suspiciously theoretical, so let’s look at some concrete examples of what kind of useful artifacts can exist across products and user groups:

Highlights and clips for marketing and social media: It’s easy to create raw video content today. Simply turn on a camera and start talking! Creating polished, edited content is much harder. Then you realize that platforms and audiences are so different, you’re going to need various formats (landscape versus portrait), lengths, editing styles, and even different topic selections. Automation is badly needed to ease the process of turning raw video content into useful clips, highlights, and edits.
Shared knowledge and context in the workplace: “So, what was actually said at last week’s project meeting? Did anyone take notes? What were the VP’s slides about? I guess there’s a recording of the call somewhere…but it was over 90 minutes long. If only someone could produce a summary to share with the team. And include the VP’s slides. And edit the video recording down to a five-minute clip that everyone could watch before the next meeting. That way we’d easily save twenty minutes on that call and a lot of frustration. Also, should we add some graphics so it’s easy to skim through the video?…”
Teaching: So much learning happens today via video content, but most teachers don’t have access to professional video production teams. The immediacy of live teaching doesn’t easily come across in straight-up recordings. Automatic editing can expand the audience and make it much easier for students to access the material on their own.

Other possibilities include:

Personalization
Smart ad insertion
Content moderation and monitoring
Capturing software product usage for design feedback

Although these ideas are quite different in their execution and target audience, there’s something they have in common. Creating the proposed artifacts is labor-intensive and often calls for both design skills and domain knowledge. You might need someone who can edit video, create decent graphics, understand marketing segments, product positioning, partnered content, and so on. This human expert just doesn’t scale.

If these artifacts are to be produced, they need to be industrial, not artisanal. As it turns out, today’s AI is already (often—not always!) quite capable at expressing human-derived creativity as a purely digital industrial solution. Chapter 2 will discuss AI and how it can be applied to real-time video.

Get AI Processing and Automatic Editing for Real-Time Video now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

AI Processing and Automatic Editing for Real-Time Video by Pauli Olavi Ojala

Chapter 1. The State of Video

Why Video Should Become Fluid

Why Real-Time Video Is Interesting

Note

What Holds the Content Together

Note

What You Could Be Producing

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly