When technology unleashes creativity: the renaissance of the digital spoken word.

Commercial radio broadcasting began a hundred years ago. A lot has changed since then. The BBC was born in 1922, but already two decades later, video took off and became the biggest mass media. My GenX grew up obsessed by TV. “Video killed the radio star” (1980) used to be a generation anthem. Then came GenY, with online video and the post-TV world.

Today, people still spend more time looking at screens than listening to radio, podcasts or other audio content.

But things are moving fast and in unexpected directions. First, video did not kill radio. Radio, like video, became digital, mobile and (artificially) intelligent.

Voice and audio are mobile (the smartphone is now the main gateway to multimedia), interactive (Siri & Co.) and social (Clubhouse & Co.).

In 2020, people listened to spoken-word content more than ever, getting relief, education, information and entertainment:

With the fall in in-person interactions and blurrier lines among work, home and alone time, record numbers of people have turned to non-music audio content during the past year. It can be a welcome mental health break, an attempt to escape the monotony of pandemic life, or a way to socialize remotely.

The Washington Post, 15 March 2021

Media and entertainment are embracing voice technology faster than other industries 💪. According to a recent research by Speechmatics, media companies and publishers have implemented almost a quarter of all voice technology applications (The 2021 Voice Predictions Report).

From high-end solutions such as automated subtitling and captioning of TV content, to individual creators launching their voice chats, there are plenty of examples of valuable spoken-word media experiences.

The possibilities are endless. To see them, you need to ignore the journalistic hype about the latest darlings, like Clubhouse, and your personal bubble, too. This post takes a step back: with some maps, it outlines how technology drives voice-led editorial experiences, and delivers a showcase of inspiring examples.


Section I – Media is technology. A bird’seye view.

When analysing the evolution of media categories, we must remember that MEDIA IS TECHNOLOGY. Technology creates the conditions to develop new media categories, and it determines their business models at large.

Still, you often hear that technology is a means to express existing media rather than the media itself. That explains, for instance, the confusion made when we ask what television is and what it is not, in the age of the Over-The-Top and YouTube.

What about the digital spoken words (aka the non-music content)? First, let’s look at the overall evolution of voice technology and its trajectories:

1. The “AUDIO GOES EVERYWHERE” trajectory is the most intuitive: we all have experienced the transition from CDs to iTunes to the streaming on-the-go.

Like music, the spoken word has become a pervasive, on-demand and/or live experience that we access primarily from our smartphones.

Other media types took the same trajectory, with a difference: the intimacy of audio and its ductility – we can listen AND do something else, which gives it an advantage still unparalleled. We are recovering from a video hangover (who has not suffered from “zoom fatigue”?) and feel the need for the more subtle experience of voice listening.

2. The “ALL CAN BECOME CREATORS” trajectory follows a similar pattern to video, with some distinctions.

Mobile phone makers and big tech platforms have pushed consumers to “fill” their business with free content to create network effects, mostly through visuals: video and immersive “stories”. Until recently, these companies did not pay the same attention and efforts to the spoken word.

Unnoticed for a long time, but steady, the advance of podcasts showed the new opportunity. In the meantime, the emerging paradigm of the creator economy filled the gap left by big tech, with tools and solutions made for creators.

As a result, many voice tools are available without the need to join a Big Tech platform: you can get automatic transcriptions of interviews and create your podcasts outside of Apple and Co. You can access the same tools used by publishers, at the fraction of an enterprise cost. One of the best examples: descript.com, born 2017, VC-backed, is an audio/video editing tools building on the idea to make editing as easy as cutting and pasting a Word doc.

However, the pressure of platforms to own this space is high. When Spotify launched Anchor.fm in Autumn 2020, the game changed. The spoken word is a new battlefield for platforms that want to establish as the home for both professional products (commissioning originals from publishers) and new prosumer talents (the next web stars).

3. The trajectories of ARTIFICIAL INTELLIGENCE – text to speech and speech recognition – are the most disruptive, not only for media and publishing.

What matters here: we have reached the tipping point where complex technologies are available to all, not only to enterprises. Small publishers and individual creators can tap into solutions unaffordable until recent times.

Technology-savvy publishers have in-sourced the talents and the tools to develop their proprietary solutions (examples based in the EU: Axel Springer in Germany, the Schibsted Group in Sweden).

SECTION II Quick dive into neural text-to-speech and voice recognition. The opportunity for publishers.

The case of NTTS, Neural Text-to-Speech, is exemplary: Amazon launched Polly, a Neural TTS solution, in November 2016: in July 2021, it includes 60 voices in 29 languages, at a price that would allow converting “The Adventures of Huckleberry Finn” by Mark Twain into a spoken audiobook for less than 9 euros. 

Amazon Polly uses NTTS technologies based on deep learning to generate synthesized speech of realistic quality. It learns to speak how children do. The engine iteratively improves its speech quality by understanding the differences in speaking styles. Amazon’s Polly Newscaster Style makes narration sound almost real, modulating its voice depending on whether it is reading a newscast, a sportscast, or a college lecture.

As a result, the voices available can adopt different speech styles depending on what output you need. You find an excellent technical explainer under this link (in German). Liking Amazon or not, the appeal of a service that includes custom lexicons for product names, company jargon or foreign words in the desired form and ad hoc voices (upon request), exclusive to your organization, is unrivalled at the moment.

Many potential use cases. Amazon Polly and its competitors are paving the way not only for the massive development of audiobooks (still, the biggest market for the digital spoken word, way more than podcasts), but for an entirely new category of speech products, starting from the educational publishing.

What about Google? Fewer voices, only English for the moment, and a focus on the massive conversion of e-books into audiobooks. Currently, Google markets its automated audio-narration like a solution for publishers (see the Google presentation here). Google started a beta phase mid 2020 with selected publishers in the USA and the UK.

Voice recognition and voice-controlled devices.

Keyboard, mouse, touchpad: all due to be replaced by the voice? Will voice become the new universal interface of humans with machines? The ultimate destination is not clear yet.

We are transitioning from a world controlled by clicking, pushing, typing, swiping into an ecosystem where we can say – to Siri, Google Alexa, & Co. – what we want, what we look for, what we need to know.

The pandemic has been playing a propeller role here, too. Social distancing and the need to avoid commonly touched surfaces have given voice technology a new momentum, encouraging the adoption of touch-free controls. Data confirm that people engaged much more with their smart devices at home during 2020.

It is a natural process: after all, voice is the most comfortable interface for humans, and smart speakers have reached a 95% word recognition accuracy (source: Google).

However, assistive voice technology and smart speakers have not yet been fully used by publishers and media. Things are changing:

  • On one side, optimizing editorial content for voice search can become a secret weapon for publishers to reach new audiences.
  • On the other side, content atomization applied to the spoken-word content opens endless possibilities to develop interactive experiences – “ask and get”, “search and find” etc. (see later a couple of examples in the showcase).

[mailpoet_form id=”7″]