Aligorith's Lair: storytelling

Sunday, January 8, 2017

Voice Controlled AI Devices - A Reaction Post

In response to the article about voice-controlled boxes being activated by a news item about how a kid managed to buy a dollhouse + cookies off Amazon via the voice control.

Interpreting sound has never been an easy thing. Not for humans, and definitely not for computers! If you actually think about it, it's not that hard to imagine how hard it is for a computer to understand speech and sounds. For example:
* How many times have you had trouble understanding someone's accent? Or had a misunderstanding because you misheard someone's muffled speech over a noisy/muffled/faint/crackling/unreliable phone? Well, guess what, for a computer doing voice recognition, the only input it's got is the sound coming in from the microphone... which of course is mixed in with everything else going on sonically in that environment (e.g. TV's, smartphones, gaming consoles, music players, rangehoods, kitchen equipment, aircon, running taps, open windows/traffic-noise/neighbours, bickering flatmates, etc.). And that's not to mention that the users may be out of range of the microphone, or the microphones may be cheap trash bought for bargin basement prices, and have been wired backwards...

* How many times have you been watching a film or tv show, and found yourself lurching for the fire escape as a siren sounded on screen? Or reached for your phone, only to realise that it wasn't your phone ringing, but that of the lady at the next table? Or perhaps you've responded to someone calling your name, only to find that a stranger had been calling another stranger, and not you (the now slightly embarrassed sucker trying to pretend that you didn't just not-answer to your name). Clearly, even us humans get it wrong quite often, but at least we often have the benefit of *context*, the ability to use our other senses to diambiguated the situation, and a few other "on-the-fly" techniques. (This probably goes some way towards explaining why there's a reason that people like me really don't like answering phonecalls or having to call people on the phone...). Anyways, if it's hard for us humans to get this stuff right, expect the computers to have an even harder time to disambiguate all of this!

Inspired by all this, I wondered what a "day in the life" of one of these voice recognition boxes would be, when deployed in a domestic environment that's not kindof far from the "idealised model-human" fantasy that designers often find themselves falling back to... The answer was that it would feel like they were a lost and isolated operative thrust into a war zone - "hostile enemy territory"...