Idealog Columns: VOICE OF MULTITUDES

Dick Pountain/PC Pro/Idealog 208: 15/11/2011

It can't have escaped regular readers of this column that I'm deeply sceptical about several much-hyped areas of progress in IT. To pick just a couple of random examples, I've never really been very impressed by voice input technologies, and I'm extremely doubtful about the claims of so-called "strong" Artificial Intelligence, which insists that if we keep on making computers run faster and store more, then one day they'll become as smart as we are. As if that doesn't make me sound grouchy enough, I've been a solid Windows and PC user for 25 years and have never owned an Apple product. So surely I'm not remotely interested in Apple's new Siri voice system for the iPhone 4S? Wrong. On the contrary I think Siri has an extraordinary potential that goes way beyond the purpose Apple purchased it for, which was to impress peoples' friends in wine bars the world over and thus sell more iPhones. It's not easy to justify such faith at the moment, because it depends upon a single factor - the size of the iPhone's user base - but I'll have a go.

I've been messing around off and on with speech recognition systems since well before the first version of Dragon Dictate, and for many years I tried to keep up with the research papers. I could ramble on about "Hidden Markoff Models" and "power cepstrums" ad nauseam, and was quite excited, for a while, by the stuff that the ill-fated Lernhout & Hauspie was working on in the late 1990s. But I never developed any real enthusiasm for the end results: I'm somewhat taciturn by nature, so having to talk to a totally dumb computer screen was something akin to torture for me ("up, up, up, left, no left you *!£*ing moron...")

This highlights a crucial problem for all such systems, namely the *content* of speech. It's hard enough to get a computer to recognise exactly which words you're saying, but even once it has they won't mean anything to it. Voice recognition is of very little use to ordinary citizens unless it's coupled to natural language understanding, and that's an even harder problem. I've seen plenty of pure voice recognition systems that are extremely effective when given a highly restricted vocabulary of commands - such systems are almost universally employed by the military in warplane and tank cockpits nowadays, and even in some factory machinery. But asking a computer to interpret ordinary human conversations with an unrestricted vocabulary remains a very hard problem indeed.

I've also messed around with natural language systems myself for many years, working first in Turbo Pascal and later in Ruby. I built a framework that embodies Chomskian grammar rules, into which I can plug different vocabularies so that it spews out sentences that are grammatical but totally nonsensical, like god-awful poetry:

    Your son digs and smoothly extracts a gleaming head
          like a squid.
    The boy stinks like a dumb shuttle.

So to recap, in addition to first recognising which words you just said, and then parsing the grammar of your sentence, the computer comes up against a third brick wall, meaning, and meaning is the hardest problem of them all.

However there has been a significant breakthrough on the meaning front during the last year. I'm talking of course about IBM's huge PR coup in having its Watson supercomputer system win the US TV quiz show "Jeopardy" against human competitors, which I discussed here back in Idealog 200. Watson demonstrated how the availability of cheap multi-core CPUs, when combined with software like Hadoop and UIMA capable of interrogating huge distributed databases in real time, can change the rules of the game when it comes to meaning analysis. In the case of the Jeopardy project, that database consisted of all the back issues of the show plus a vast collection of general knowledge stored in the form of web pages. I've said that I'm sceptical of claims for strong AI, that we can program computers to think the way we think - we don't even understand that ourselves and computers lack our bodies and emotions which are vitally involved in the process - but I'm very impressed by a different approach to AI, namely "case based reasoning" or CBR.

This basically says don't try to think like a human, instead look at what a lot of actual humans have said and done, in the form of case studies of solved problems, and then try to extract patterns and rules that will let you solve new instances of the problem. Now to apply a CBR-style approach to understanding human every-day speech would involve collecting a vast database of such speech acts, together with some measure of what they were intended to achieve. But surely collecting such a database would be terribly expensive and time consuming? What you'd need is some sort of pocketable data terminal that zillions of people carry around with them during their daily rounds, and into which they would frequently speak in order to obtain some specific information. Since millions upon millions of these would be needed, somehow you'd have to persuade the studied population to pay for this terminal themselves, but how on earth could *that* happen? Duh.

Collecting and analysing huge amounts of speech data is a function of the Siri system, rendered possible by cloud computing and the enormous commercial success of the iPhone, and such analysis is clearly in Apple's own interest because it incrementally improves the accuracy of Siri's recognition and thus gives it a hard-to-match advantage over any rival system. The big question is, could Apple be persuaded or paid to share this goldmine of data with other researchers in order build a corpus for a more generally available natural language processing service? Perhaps once its current bout of manic patent-trolling subsides a little we might dare to ask...

[Dick Pountain doesn't feel quite so stupid talking to a smartphone as he does to a desktop PC]

Idealog Columns

Tuesday, 3 July 2012

VOICE OF MULTITUDES

No comments:

Post a Comment

INTERESTING TIMES?

Search This Blog