#mce_temp_url#If you’ve given this a bit of thought, you’ve probably figured out that in order to understand grammar, the nearly infinite possible combinations of words, expressions, numbers, and dialects should require a not insignificant amount of computing power, and a good deal of creativity.

There’s business value, all right.

The ability to coerce our highly variable, fluent speech into a format understandable and usable by computer logic is a complex task with great potential reward. Voice control has long been an aspiration in the field of human computer interaction. Our friends at Google and Siri seem to have come a long way, but these features are mostly unavailable for our use. This highly valuable service is also extremely proprietary and guarded. The value can be seen so clearly that companies are fighting tooth and nail to kill off any potential competition. Google guards its speech recognition service behind miles of servers and proprietary code, opaque to the outside world.

This initial step, the literal translation of spoken words into written text is also only a small fraction of the problem. Not only must words be translated into text, but they must then be parsed and normalized into a grammar that computers can understand and represent in rigid, un-fluent data structures:

Imagine a cardboard packing box that contains a single word or phrase. The box is opaque – you don’t know which word it contains, but there is a white label on the box that displays an identification number, a category, and the identification numbers of several other smaller boxes with which this box must be packaged for shipping. Each smaller box contains other unknown words and phrases, and must be put in the correct order.

Now consider that you are given a set of these boxes, and told, “Without opening any boxes, follow the instructions contained within,” but you don’t understand what’s actually in the boxes, you can’t read them. All you can do is put them in the right order, and ask someone else what they think each box’s contents are (without looking.)

This is effectively what Google voice control and Apple’s Siri (and IBM’s Watson) are doing – building up a set of boxes, each one containing a word, then consulting a reference (some kind of neural network or other AI system) to determine what to do with each box. Ultimately, companies like these are using your words, phrases, and conversations to build lists of marketing data, analyze trends, and figure out how to stay ahead of the curve in their respective business sectors. In the end, a person needs to assign meaning to each of these boxes, but over time, computers should gain a better understanding of what certain boxes mean, and build relationships between them.

So as you can see, there’s a lot more involved than simple transcription; there’s a whole world of logic required in order to put meaning behind these words.

Read the full article, and learn more about PrettyTime :: NLP at ocpsoft.org.