Language Identification System: How to recognize other languages than English


News: Language Identification System: How to recognize other languages than English

  1. Language identification systems usually fail when we are analyzing short sentences, from the solutions based in ngram to solutions based in dictionaries, usually fail when analyzing short Sentences, but we've arranged a new technique based in stemming (porter) + stopwords, that does the job and works pretty fast.

    In this article we share with you the results of this analysis and the times we get compared with google apis.  

    What's your opinion? Can Stemming + stopwords complete n-grams to solve the language analysis tools for short phrases or shall we forget about Twitter and just don't analyze those damn short messages.

    Kind Regards.

  2. Snowball might be good for some languages, stopwords for a different (slightly overlapping) set of languages, but in general languages, you need to find other solution. For example variable-sized ngram with the combination of a text-distance algorithm is able to give pretty good language-independent results, at least I see that in my dictionary program.

  3. Hi Istvan,

               As far a as i know, tika is detecting languages based on ngrams, and based in our tests this option was fantastic in the configuration side and when was analyzing long text, but loses precission when talking about short phrases.

               Our strategy actually is mixing algorithms to choose the best option and in this case can fit your distance idea. Are you using Levenshtein's distance?

    Kind Regards.




  4. "Are you using Levenshtein's distance?"


    I do. Sometimes, depending on ngram analysis :)

  5. Analyzing short texts will always be a major headache as they contain so little information. Regarding the Twitter case, it may be a good idea to add multiple posts from the same user together. This way you get more words. And one user probably uses only one language to post. I have written a web service myself to identify the language of sentences. It can detect 100+ languages and has an easy to use API. You can test it at