-
Combining Google Language API and Lucene (6 messages)
- Posted by: Vinicius Carvalho
- Posted on: July 08 2009 10:42 EDT
Lucene is one of the most used IR frameworks around. But in order to work properly it's documents must be indexed/analyzed in a proper manner. Choosing the right Analyzer implementation could be the difference between a good and a bad index. In this post http://blog.furiousbob.com/2009/07/06/automatic-language-detection/ I present a simple fragment of code to use Google's language API. One could use these API in order to instantiate the correct Analyzer for it's Lucene application.Threaded Messages (6)
- Re: Combining Google Language API and Lucene by Vachon Ulrich on July 08 2009 16:40 EDT
- Re: Combining Google Language API and Lucene by Vinicius Carvalho on July 08 2009 16:53 EDT
- Re: Combining Google Language API and Lucene by Amin Mohammed-Coleman on July 09 2009 08:03 EDT
- libraries for offline use by Ulf Dittmer on July 10 2009 03:12 EDT
- Re: libraries for offline use by Faizal Abdoelrahman on July 10 2009 04:20 EDT
- Re: libraries for offline use by Vinicius Carvalho on July 10 2009 13:17 EDT
-
Re: Combining Google Language API and Lucene[ Go to top ]
- Posted by: Vachon Ulrich
- Posted on: July 08 2009 16:40 EDT
- in response to Vinicius Carvalho
Good stuff, But if you are offline... I had developed a similar feature which used a neural network. It computed entries built with n-gram fragments of any text of any langage. Maybe Google work like this? -
Re: Combining Google Language API and Lucene[ Go to top ]
- Posted by: Vinicius Carvalho
- Posted on: July 08 2009 16:53 EDT
- in response to Vachon Ulrich
Yeah, being online is a must. I was thinking in using some sort of classifier for that, a naive-bayes for instance. I may still implement it one day. A good thing about google tough is that large number of languages supported. I don't think I could find that much documents in different languages to train my classifier. -
Re: Combining Google Language API and Lucene[ Go to top ]
- Posted by: Amin Mohammed-Coleman
- Posted on: July 09 2009 08:03 EDT
- in response to Vinicius Carvalho
Cool stuff indeed. I will definitely follow the progress on this! -
libraries for offline use[ Go to top ]
- Posted by: Ulf Dittmer
- Posted on: July 10 2009 03:12 EDT
- in response to Vinicius Carvalho
Several libraries for language detection are available that do not require online access, e.g. this one: http://www.jroller.com/melix/entry/nlp_in_java_a_language -
Re: libraries for offline use[ Go to top ]
- Posted by: Faizal Abdoelrahman
- Posted on: July 10 2009 04:20 EDT
- in response to Ulf Dittmer
yes, building an n-gram model on a corpus and subsequently comparing it to the n-gram frequency distribution of the to be classified sentence works extremely well for language identification. Even on very short sentences. I am surprised so little software/api's/frameworks seem to take advantage of this algorithm in the context of i18n. -
Re: libraries for offline use[ Go to top ]
- Posted by: Vinicius Carvalho
- Posted on: July 10 2009 13:17 EDT
- in response to Ulf Dittmer
Great stuff. One good thing on google api is the large number of languages supported :)