Idealog Columns: SPEECHLESS

Dick Pountain /Idealog 325/ 05 Aug 2021 02:1

I was casually watching a Hawaiian volcano erupt on YouTube, as you do, when I felt something slightly creepy in the narration that I couldn’t quite identify. It sounded like an adult American male, but with something very subtly wrong about its rhythm. I started noticing the same in other US videos, and posted on Facebook to ask whether anyone else thought synthetic digital voices were being used: consensus was probably not. Then last month the MIT tech review published an article about AI voice actors (www.technologyreview.com/2021/07/09/1028140/ai-voice-actors-sound-human ) which said that although “deepfake voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech.” It’s now possible to sample the voice of a human actor, or someone in your firm, then have a company build and rent to you a synthesiser that speaks your PR materials so well as to be undetectable.

I’ve always had an inexplicable interest in voice synthesis. Most people nowadays regard computers as visual devices, but to me making them speak is just as interesting as drawing pictures on them. The first halfway decent text-to-speech (TTS) program I got was back in Windows 3.1 days - called Monologue, it came bundled with my first Soundblaster card. Monologue had a raw, Steven Hawking-like delivery, but it did support a simple syntax for marking up texts to add some degree of expression, and I amused myself getting to read poetry, including this poem (https://soundcloud.com/dick-pountain/the-primal-proof ) that Felix Dennis had dedicated to me. Over the next few years I kept in touch with the state of text-to-speech and voice-recognition art, particularly via the ground-breaking work of the Belgian researchers Lernhout and Hauspie who I mentioned in this column in 1999. During the 13 years I spent living part-time in Italy I keenly followed the progress being made by Google with its voice and translation engines, and by the 2000-teens it was becoming possible for me to use an Android phone like Star Trek’s universal translator when I needed to extend my feeble vocabulary, Type what I want to say into Google Translate, have it spoken to me in Italian and practice it before going into, say, a police station or hardware store. I didn’t quite have the nerve to hold up the phone to speak for me...

By this time the field was splitting between cloud-based and local ‘edge’-based software: cloud voice services were becoming convincing enough to be used in those scams that MIT Tech mentioned. I’m not so much interested in those as in the cruder TTS programs that one can get for free to run on a phone or Chromebook. One in particular, called Vocality, tickled my fancy. A simple interface onto Google Speech services, it offers control over speed and pitch plus a large selection of national voices. For example it lets me create comical action-movie-villain dialogues by choosing, say, a Russian or Albanian voice and setting pitch ridiculously low. I also discovered that by typing in strings of random characters and setting speed high I could generate something resembling ‘mouth-music’, as in this catchy little ditty (https://soundcloud.com/dick-pountain/tuvan-gruv). Politically-incorrect perhaps, but fun.

Before writing this column I checked out the current state of local TTS apps and found dozens of free ones that are massively improved: for example Balabolka, Natural Reader, Panopreter, TTSReader and Wordtalk offer good quality speech and even customisable voices.

But it’s in the cloud-based arena that things get scary. Nuance is a typical company offering to “deliver a human‑like, engaging, and personalized user experience. Enhance any customer self‑service application with high‑quality audio tailored to your brand.” Or maybe Amazon which offers the Polly API to developers to add Alexa-like abilities to their products. For movie professionals LucasFilm offers ReSpeecher to “create speech that's indistinguishable from the original speaker. Perfect for dubbing an actor's voice in post production, bringing back the voice of an actor who passed away, and other content creators' problems.” But it’s Amai (https://amai.io/ ) who really spell it out: “Sorry, voice actors, we will replace you soon […] this text is painted with the Love emotion.You can highlight any text, choose any emotion and listen to how it sounds, for example this phrase is pronounced with the Happiness.” Go to their site to hear the perky result.

Idealog Columns

Thursday, 3 March 2022

SPEECHLESS

No comments:

Post a Comment

AWFUL LOT OF COFFEE

Search This Blog