Video:

Discovering data by structuring unstructured text

TheServerSide.com

On the surface, the task of scanning a newspaper column or magazine article and subsequently extracting key pieces of information doesn't appear to be a challenging computer science problem. After all, it should be just a matter of picking out important words and key phrases and mapping it all within a database. Surprisingly though, the frustrating truth is that dealing with unstructured text, be it a Facebook status update or a telephone intercept that has been transcribed to text, is a massive challenge that the industry has been struggling with for decades.

The primary challenge most of our customers run into is with the ambiguity of the language.

Bryan Bell,
executive vice president, Expert System

An employment application form is well structured. Even if key fields like a date of birth or ZIP code have elements transposed, such errors can be handled quite effectively with various data cleansing tools. But a sentence such as "'There's a fox on the plane"' isn't quite as easy to deal with. Does that mean there's an attractive woman -- a fox -- sitting in seat 4B of a transatlantic flight, or does it mean there's a wild animal sleeping on a piece of woodworking equipment -- a carpenter's plane? "The primary challenge most of our customers run into is with the ambiguity of the language," said Bryan Bell, executive vice president with Expert System, when talking about the challenges everyone from intelligence agencies to pharmaceutical companies are having when trying to deal with brobdingnagian amounts of incoming information. And it is because of this ubiquitous ambiguity of spoken and written languages that developing tools that effectively convert unstructured text to structured data have largely eluded the industry.

The other big problem in dealing with unstructured text is one of balance. Simply pulling keywords out of text files is fast and easy, but it also generates an enormous amount of vague results and when consumed by standard analytics tools, an unworkable number of false positives and fruitless leads emerge. On the other hand, various filtering rules can be applied, but sadly, the number of restrictions needed to understand all of the linguistic intricacies of the English language can bring even the most powerful computing environments to a halt. Furthermore, the more restrictive the filters are the more likely false negatives will arise, where data that is good and is useful is tossed aside because somewhere along the line a red flag was raised.

Organizations are dealing with a fire hose of incoming data these days, but that data is useless if it can't be properly processed and incorporated into various workflows that do analytic processing. This is why there is a massive need to get data structured and subsequently integrate that data into existing workflows. And what's the end result? "Organizations discover information they didn't know was in their data, because they are able to tag it, structure it and organize it in a structured manner," said Bell. "No matter what industry you're in, if you have this ambiguity problem and large amounts of data, you're part of the target market for semantic intelligence."

As computers become smarter and faster, there is something redeeming about the fact that they're still not smart enough to understand the various nuances of speech that the carbon based life forms take for granted. But the tools and technologies being applied in this space are becoming ever more powerful, the gap between human speech and computer understanding is narrowing to a point where unstructured text will be easily structured, without false positives or false negatives even being a concern.

30 May 2014

Related Resources