It sounds like a producer’s dream come true. Why hire voice talent when you can generate spoken-word audio, yourself, with a script and text-to-speech software?
In Kubrick’s science fiction epic, 2001: A Space Odyssey, we are introduced to The H.A.L. 9000, an onboard computer designed to assist the Discovery spacecraft astronauts and control the ship’s systems. Nicknamed “HAL”, we get to know the anthropomorphic machine though his voice—the means by which he interacts with the ship’s crew. Among the first serious depictions of synthetic speech, HAL offers a glimpse into how such tech might one day be realized.
Through his dulcet tones, HAL emerges as a full-on character, complete with an arc—a journey in artificial intelligence. In fact, HAL grows a tad too intelligent, if that’s what we’d call neurotic and self-preserving, and things go badly (very badly) for the humans in his care. But setting aside homicidal computers, what may we say of the voice, itself? How has synth-speech fared in its own odyssey—is it a satisfying imitation of the real thing, and how might we use this technology in our learning products?
By the early 80s, commercial examples were commonplace. Products like Mattel’s Speak & Spell, Intellivision, and early home computers were among the first consumer-geared showcases of the technology. You may recall the NORAD computer from the film, War Games, who intoned, “Shall we play a game?”. If you can still hear the famous phrase in your head, you’ll note the grainy machine-monotone that typified early iterations—none of them convincing simulations of the real thing.
But innovation marches on, and we’ve seen sampled speech come to the fore, shedding the skin of its robotic forebears. Sampled speech is built from recordings of actual voice artists—bits of audio that are concatenated by the computer into words and phrases to more naturalistic effect. If the old stuff was pure robot, this new breed is cyborg. It sounds better than purely machine-generated voices, and in small doses can be convincing. In fact, some of the commercial databases of sampled voices are quite robust, enriched with myriad recorded takes of words and phrases. This offers the computer more choices and edges us closer to emotive realism.
We still detect the machine behind the mouth, that digital drawl, coloring every phrase it utters.
But while the state-of-the-art has advanced, our ears aren’t fooled. Synth speech in its best form falls short of the warmth of real speech. Our voices possess all those lilts and pauses and ticks that are the expression of the human mind. As far as it’s come, we still detect the machine behind the mouth—that digital drawl, coloring every phrase it utters. Meanwhile, back at the lab, the wizards are working away, eking out gains. But is synth-speech ready for prime time? And perfection aside, is it good enough to replace the human voice in at least some narrated products? Let’s look at some production realities.
Text to Speech Tools
It sounds like a producer’s dream come true. Why hire voice talent when you can generate spoken-word audio, yourself, with a script and text-to-speech software? From a cost standpoint, it’s enticing. As we get into production, however, issues abound.Topping the list are botched diction and mispronounced words. The longer the script, the bigger the headache. It becomes a game of rewriting words, reducing them to their phonetic equivalents and using tricks to correct misplaced emphasis. There’s labor in this—and you never arrive at a fully satisfying result. Your best hope is to eliminate obvious errors and achieve a less-awkward outcome.
Let’s also consider our audience, what it takes to listen to machine-talk. The longer the audio, the greater the fatigue as we mentally correct content. Even the best text-to-speech engines require us to maintain two mental tracks running in parallel. One consists of the audio to which we are listening. The second is a series of instant replays we perform to rephrase the content—the brain’s equivalent of smoothing jagged pills before swallowing. This adds cognitive load, stealing from comprehension. Nevertheless, I routinely listen to text-to-speech audio, especially via the Google TTS engine that can power apps like Pocket for Android. It’s good enough for article-length content, as long as I’m willing to untie verbal knots along the way.
As you approach your own make/buy decision, I’d remind you to consider what you are asking of your audience.
As a producer and consumer, I’ve experienced both sides of the equation, and this has helped me navigate the tradeoffs. I’ve produced voice work for e-learning, documentaries, audiobooks, and podcasts, working in the worlds of professional voice talent and also with what synth-speech has to offer. But I’ve only dared to use synth-speech in products a few times—and only when I intended it for a stylized effect (because there’s no hiding that it’s synthetic) and in non-fatiguing doses. I’ve mostly used text-to-speech as a placeholder or a pricing tool. On several occasions, I’ve generated a script with text-to-speech software to help me estimate run-time, giving me an idea of what professional voice talent should cost when we go to record.
As you approach your own make/buy decision, I’d remind you to consider what you are asking of your audience. I’m all for “good enough”, but given those fatigue factors, I can only justify long-form text-to-speech as some kind of austerity move. If my budget would only allow for synthetic narration, I’d find a way to redesign the product so as not to use it at all. That’s the bad news.
The good news is that the voice market is more affordable than you may realize. Right now, you can put your project up for bid at several voice talent portals, solicit auditions, and expect rapid turnaround terms for finished product. In the scheme of the typical e-learning project, the cost can be nominal, so I usually advise against stripping out a narration requirement as the first line of cost-cutting. Voice talent doesn’t have to break the bank, and in an age when electronic learning can be, well, a little too electronic, your users will thank you for the human touch.
My parting advice is that you keep an ear on the technology and try it for yourself. If not as narration in your e-learning, then perhaps in reading the latest learning article to you. For starters, you might load up this article and let your phone do the talking.
Anthony Rotolo manages e-learning programs for the award-winning Defense Acquisition University.