Sunday, 1 July 2012

SPEAKING IN TONGUES

Dick Pountain/21 April 1999/Idealog 57
                 
It's increasingly rare for me to have my mind blown by technology, but blown it certainly was last week. The occasion was a symposium on speech technologies at Sussex University sponsored by the Belgian firm Lernhout and Hauspie. L&H is the almost undisputed leader in speech technologies, for which reason Microsoft and Intel have both bought pieces of the company. The demonstration that blew my mind didn't looked particularly revolutionary. On the stage sat two young women operating notebook computers connected by a LAN, with both their screens projected overhead. It didn't hurt that they were both pretty, but more significant was that one woman was Spanish and the other German, and both spoke English with quite noticeable accents. The Spanish woman spoke into her headset mike, some anodyne business message about visiting Germany and fixing a meeting next week. After a few seconds pause her words appeared on her notebook screen, 100% accurately recognized. After a few seconds more her words appeared on the other woman's screen, translated into German, and then the second notebook spoke the message aloud in German, in a smooth, inflected and recognizably female voice rather than in Hawking/Dalek speak.

Afterwards L&H boss Jo Lernhout explained to us the three main technologies involved. The next generation of L&H's VoiceExpress speech recognition package has vastly reduced training time (each women recalibrated it by counting up to 8 before the demo). L&H's PowerTranslator text-to-text translation software achieved what even my poor German could see was a reasonable translation (albeit it of a simple and unambiguous business sentence). Finally there was L&H's RealSpeak text-to-speech engine, which uses new phoneme concatenation algorithms.

Hawking-era TTS synthesisers employ formant synthesis, in which a software model of the human vocal tract is used to generate phonemes one at a time. The advantages are flexibility, since by adding suitable rules the same engine can speak any human language, and low memory cost. The downside is that one-phoneme-at-a-time gives that flat, Dalek intonation. Concatenation algorithms instead use a library of snippets of recorded human speech, splicing them together as necessary. L&H's version of concatenation stores not just single phonemes but diphones (ie. pairs of consecutive phonemes, like the 'gz' sound in the middle of 'example'), triphones, tetraphones and even whole words. It uses a lot of memory - but now we have a lot of memory - and it sounds wonderful. RealSpeak also makes a stab at capturing prosody, the rising and falling in pitch that is so important to perceived meaning in human speech. In fact Lernhout demonstrated that they can now capture the intonation of an individual human voice by analysing whole sentences, then insert new content into this prosodic 'envelope'; they can say things that you haven't ever said, in your voice.

Jo Lernhout was quite confident that the means are now within grasp, probably no more than three years away, to build a hand-held box into which you can speak a sentence in, say, English, press a button and have it repeated in your own voice but in German, or Spanish, or Chinese. Another little piece of Star Trek has fallen to earth.

I've always been interested in speech synthesis, ever since writing my own drivers for an ancient Covox synthesiser in Forth nearly 20 years ago. I've also periodically checked up on the current state of speech recognition, but until last week have not been impressed that it's quite here yet. And I'm still not convinced of the case for speech as the ultimately natural form of input to a computer - talking to most people is already a chore, so I certainly don't want to have to talk to my bloody computer. Typing, handwriting recognition and voice will always coexist, each being most appropriate for a different situation (try taking notes by speech input in a rush hour tube train, or generating a 25,000 line C++ program in handwriting).

The blowing of my mind was completed by the speaker who followed Lernhout. L&H's VP of Language Services Florita Mendez told us that "language remains the last obstacle to globalization". As a Spanish speaker Mendez knows better than any 100% Anglophone can that it's just not true that the language war is over and English won, which life in the computer industry might otherwise tempt you to believe. The vast majority of people in the world do not speak, and will never speak, English and many of them don't want to. The preservation of a language, as any Welsh person will tell you, is an important part of preserving a culture.

Following Anthony Giddens' Reith Lecture series globalization is very much in the news as I write this column, and computers are one of the principal technologies driving the process. It's thanks to computer and telecommunications technology that transnational companies can transfer capital anywhere in the world in seconds, and monitor world stock markets and commodity prices. But here we have a computer technology that might actually tend to preserve rather than obliterate, cultural differences; computers that can translate for us in real time, embedded into telephones and email systems, have the   potential to let us all communicate without having to buy into the whole Anglo-Saxon business culture. On the darker side, this technology also has the potential to sever our last link with objective reality. For years now Photoshop has been capable of painting you out of (or into) the picture, but now they can steal your voice too, and make you say things you never said - your word may not be your bond for much longer.    

No comments:

Post a Comment

POD PEOPLE

Dick Pountain /Idealog 366/ 05 Jan 2025 03:05 It’s January, when columnists feel obliged to reflect on the past year and who am I to refuse,...