News: Combining Google Language API and Lucene
Lucene is one of the most used IR frameworks around. But in order to work properly it's documents must be indexed/analyzed in a proper manner. Choosing the right Analyzer implementation could be the difference between a good and a bad index. In this post http://blog.furiousbob.com/2009/07/06/automatic-language-detection/ I present a simple fragment of code to use Google's language API. One could use these API in order to instantiate the correct Analyzer for it's Lucene application.
- Posted by: Vinicius Carvalho
- Posted on: July 08 2009 10:42 EDT
- Re: Combining Google Language API and Lucene by Vachon Ulrich on July 08 2009 16:40 EDT
- Re: Combining Google Language API and Lucene by Amin Mohammed-Coleman on July 09 2009 08:03 EDT
- libraries for offline use by Ulf Dittmer on July 10 2009 03:12 EDT
Good stuff, But if you are offline... I had developed a similar feature which used a neural network. It computed entries built with n-gram fragments of any text of any langage. Maybe Google work like this?
Yeah, being online is a must. I was thinking in using some sort of classifier for that, a naive-bayes for instance. I may still implement it one day. A good thing about google tough is that large number of languages supported. I don't think I could find that much documents in different languages to train my classifier.
Cool stuff indeed. I will definitely follow the progress on this!
Several libraries for language detection are available that do not require online access, e.g. this one: http://www.jroller.com/melix/entry/nlp_in_java_a_language
yes, building an n-gram model on a corpus and subsequently comparing it to the n-gram frequency distribution of the to be classified sentence works extremely well for language identification. Even on very short sentences. I am surprised so little software/api's/frameworks seem to take advantage of this algorithm in the context of i18n.
Great stuff. One good thing on google api is the large number of languages supported :)