Discussions

News: Combining Google Language API and Lucene

  1. Lucene is one of the most used IR frameworks around. But in order to work properly it's documents must be indexed/analyzed in a proper manner. Choosing the right Analyzer implementation could be the difference between a good and a bad index. In this post http://blog.furiousbob.com/2009/07/06/automatic-language-detection/ I present a simple fragment of code to use Google's language API. One could use these API in order to instantiate the correct Analyzer for it's Lucene application.
  2. Good stuff, But if you are offline... I had developed a similar feature which used a neural network. It computed entries built with n-gram fragments of any text of any langage. Maybe Google work like this?
  3. Yeah, being online is a must. I was thinking in using some sort of classifier for that, a naive-bayes for instance. I may still implement it one day. A good thing about google tough is that large number of languages supported. I don't think I could find that much documents in different languages to train my classifier.
  4. Cool stuff indeed. I will definitely follow the progress on this!
  5. libraries for offline use[ Go to top ]

    Several libraries for language detection are available that do not require online access, e.g. this one: http://www.jroller.com/melix/entry/nlp_in_java_a_language
  6. yes, building an n-gram model on a corpus and subsequently comparing it to the n-gram frequency distribution of the to be classified sentence works extremely well for language identification. Even on very short sentences. I am surprised so little software/api's/frameworks seem to take advantage of this algorithm in the context of i18n.
  7. Great stuff. One good thing on google api is the large number of languages supported :)