iQoncept - Fotolia

Will voice user interfaces usurp the traditional UI?

It would appear that innovation has stalled in terms of mobile and desktop user interfaces, which is why voice user interfaces might be the future for application interactions.

John Allsopp, author of "A Dao of Web Design," laid out the theoretical foundations of responsive web design in 2000. Today, he is laying out the foundations for the next big paradigm shift in application development, which just might see traditional UIs being replaced with voice user interfaces. The next era of apps is not just about making text and graphic interfaces more snappy; it's about creating voice user interfaces that tap into the way people have interacted with each other for thousands of years.

In many respects, the UI paradigm has not advanced much since the Apple I computer. "This is actually the same era as today," Allsopp said. "One sat on the desks of a few thousand geeky people, and the other sits in our pockets. Text is the primary way of communication, and that has not changed much since 1976."

Users directly manipulate objects on a screen using mice, keyboards and touch. But the devices themselves are passive. They sit there until we tell them what to do. They are not particularly intelligent, and unlike voice user interfaces, they require our attention. What has changed is that we are more and more into our lives and out into the world. "We are basically reaching the end of this paradigm of interaction, both physically and conceptually," Allsopp said. "It is illegal to drive with them. And it is difficult to have simultaneous screen-based and person-to-person interactions."

An active voice user interface

This is a dangerous thing. It is illegal to drive while using them in many places. In Allsopp's hometown of Sydney, Australia, the government is now embedding traffic lights into the sidewalks to preclude checked-out people from walking into traffic. He said, "We have little capacity to do anything else except to drive our passive devices. We have explored all possibilities of computing, with the rest of computing be[ing] a footnote."

Most science fiction user experiences just build on this old paradigm. Movies like The Terminator and RoboCop show people interacting through text overlays on the world through a UI that has come to be known as augmented reality. "The challenge with augmented reality is that it heavily relies on the visual cortex, and it still gets in the way of person-to-person interaction."

A prelude to voice user interfaces

Allsopp argued that audio-oriented devices and voice APIs will play a key role in facilitating this interaction paradigm. While Moore's law on processing power gets a lot of press, Koomey's law is more significant for mobile devices. Jonathan Koomey, who analyzed energy usage for Lawrence Berkeley National Laboratory, found that the amount of computation power per unit of energy was growing at a faster rate than Moore's Law. This ratio increases about a hundredfold every 10 years.

Early audio interfaces emerging today, like the Apple wireless earbuds, must be tethered to a much more capable mobile phone. But Koomey's law could mean that wireless earbuds could power voice interaction on their own within a few years.

We are basically reaching the end of this paradigm of interaction, both physically and conceptually.
John AllsoppAuthor of 'A Dao of Web Design'

Allsopp initially had a negative viewpoint on the promise of voice user interfaces for two reasons. The was the reliability of speech to text, particularly if the algorithms had not been trained on a specific voice. The second is that the developers had been trying to use the existing paradigm of interaction, swapping out text and swapping in voice.

In the worst-case scenario, a user experience vision portrayed in Blade Runner shows the hero narrating a complex dialogue in order to zoom in on an image, crop it and print it out. In the movie, it takes him 2 minutes and 14 seconds, which is something that could be done with the pinch-and-zoom interface in about 2 seconds.

Better speech algorithms here today

The challenges around better speech recognition algorithms are increasingly being solved through deep-learning algorithms being developed by Wit.ai, Google Speech, IBM Watson, Microsoft Bing Speech and Amazon Alexa. It is possible to get sound into every browser and then use APIs to do speech to text with little development overhead. Allsopp believes that these algorithms are close to beating humans.

But what Allsopp believes the developers have been getting wrong is thinking about how to use the voice to make things better. He said, "I think this is the foundation of how we move beyond [the] screen-based passive era of computing we have been in since 1976. If all we are are a few keywords, all we will be doing is replacing clicking and tapping with our voice. What is happening now are ways of extracting deep and interesting meaning from speech."

This includes algorithms for doing entity analysis, sentiment analysis, emotion recognition, personality insights, translation, taxonomies and keyword extraction. All of these capabilities are now available via APIs. We can now use cloud APIs to get our devices to talk back out to us using the little things we will be wearing in our ears.

The question is: What is this paradigm going to look like? Allsopp said, "I don't know in the same way that Jobs, Sinclair and Wozniak could not imagine the history of computing to come. It is not just about combining services, APIs and experiences. It is a whole thing."

Lots of things

things could include mental health analytics of users. These devices could listen to the tone of voice and the things we are saying to improve our interactions with coworkers, friends and children. This could be about boosting our EQ, or emotional intelligence, rather than just our IQ.

This is not science fiction. Call centers are already using speech to text and sentiment analysis to provide call center workers with tools to help better handle people on the phone. Allsopp said, "We already have the technology in our browsers to build something like the Babel fish out of The Hitchhiker's Guide to the Galaxy. This technology is available now in browsers, and developers could piece together prototypes in an afternoon."

Next Steps

How to build voice user interfaces using the Alexa software development kit

An introduction to voice-based systems

Enterprise use cases for voice user interface applications

Dig Deeper on Front-end, back-end and middle-tier frameworks

App Architecture
Software Quality
Cloud Computing
Security
SearchAWS
Close