669141 members! Sign up to stay informed.

Sponsored Links


Resources

Enterprise Java
Research Library

Get Java white papers, product information, case studies and webcasts

News News News Messages: 6 Messages: 6 Messages: 6 Printer friendly Printer friendly Printer friendly Post reply Post reply Post reply XML XML XML

Combining Google Language API and Lucene

Posted by: Vinicius Carvalho on July 08, 2009 DIGG
Lucene is one of the most used IR frameworks around. But in order to work properly it's documents must be indexed/analyzed in a proper manner. Choosing the right Analyzer implementation could be the difference between a good and a bad index.
In this post http://blog.furiousbob.com/2009/07/06/automatic-language-detection/ I present a simple fragment of code to use Google's language API. One could use these API in order to instantiate the correct Analyzer for it's Lucene application.

Threaded replies

·  Combining Google Language API and Lucene by Vinicius Carvalho on Wed Jul 08 10:42:06 EDT 2009
  ·  Re: Combining Google Language API and Lucene by Vachon Ulrich on Wed Jul 08 16:40:49 EDT 2009
    ·  Re: Combining Google Language API and Lucene by Vinicius Carvalho on Wed Jul 08 16:53:48 EDT 2009
  ·  Re: Combining Google Language API and Lucene by Amin Mohammed-Coleman on Thu Jul 09 08:03:49 EDT 2009
  ·  libraries for offline use by Ulf Dittmer on Fri Jul 10 03:12:59 EDT 2009
    ·  Re: libraries for offline use by Faizal Abdoelrahman on Fri Jul 10 04:20:19 EDT 2009
    ·  Re: libraries for offline use by Vinicius Carvalho on Fri Jul 10 13:17:27 EDT 2009
  Message #310807 Post reply Post reply Post reply Go to top Go to top Go to top

Re: Combining Google Language API and Lucene

Posted by: Vachon Ulrich on July 08, 2009 in response to Message #310731
Good stuff,

But if you are offline... I had developed a similar feature which used a neural network. It computed entries built with n-gram fragments of any text of any langage. Maybe Google work like this?

  Message #310808 Post reply Post reply Post reply Go to top Go to top Go to top

Re: Combining Google Language API and Lucene

Posted by: Vinicius Carvalho on July 08, 2009 in response to Message #310807
Yeah, being online is a must. I was thinking in using some sort of classifier for that, a naive-bayes for instance. I may still implement it one day. A good thing about google tough is that large number of languages supported. I don't think I could find that much documents in different languages to train my classifier.

  Message #310847 Post reply Post reply Post reply Go to top Go to top Go to top

Re: Combining Google Language API and Lucene

Posted by: Amin Mohammed-Coleman on July 09, 2009 in response to Message #310731
Cool stuff indeed. I will definitely follow the progress on this!

  Message #310887 Post reply Post reply Post reply Go to top Go to top Go to top

libraries for offline use

Posted by: Ulf Dittmer on July 10, 2009 in response to Message #310731
Several libraries for language detection are available that do not require online access, e.g. this one: http://www.jroller.com/melix/entry/nlp_in_java_a_language

  Message #310890 Post reply Post reply Post reply Go to top Go to top Go to top

Re: libraries for offline use

Posted by: Faizal Abdoelrahman on July 10, 2009 in response to Message #310887
yes, building an n-gram model on a corpus and subsequently comparing it to the n-gram frequency distribution of the to be classified sentence works extremely well for language identification. Even on very short sentences.

I am surprised so little software/api's/frameworks seem to take advantage of this algorithm in the context of i18n.

  Message #310915 Post reply Post reply Post reply Go to top Go to top Go to top

Re: libraries for offline use

Posted by: Vinicius Carvalho on July 10, 2009 in response to Message #310887
Great stuff. One good thing on google api is the large number of languages supported :)

New content on TheServerSide.comNew content on TheServerSide.comNew content on TheServerSide.com

Developers split on open sourcing Java

Now that Oracle is absorbing Sun Microsystems, there mixed views on what should come of the Java Community Process (JCP). While some say Oracle should become the new steward of Java and keep the JCP much as it was, others argue that it may be time to open-source this widespread language. (November 24, Article)

Dependency Injection in Java EE 6 - Part 1

Reza Rahman explores the features of the proposed JSR 299, Contexts and Dependency Injection for Java EE (CDI). When approved, it promises to be a key feature of Java EE 6. (November 2, Article)

SAML: It's Not just for Web services

SAML is an XML-based standard for exchanging authentication and authorization data between security domains. The single most important problem that SAML was created to solve is the Web browser Single Sign-On problem. Many organizations are debating whether to stay with version 1.1 or move to 2.0. This article makes observations about both options. (September 28, Article)

Programming is Also Teaching Your Team

Joe Ottinger takes a look at how people learn, and applies it to the practice of programming. He notes that understanding how people learn is an essential part of working in a programming team. (September 22, Article)

Can Java EE Deliver The Asynchronous Web?

Stephen Maryka gave us an article about the Asynchronous Web and posed a number of questions that get examined like an approach to delivering Asynchronous Web capabilities through extensions to existing Java EE technologies. (July 14, Article)

JSF Flex

JavaServer Faces Flex goal is to provide users capability in creating standard Flex components, part of flexSDK which is open sourced through MPL license, as normal JSF components. This article by Ji Hoon Kim will provide an overview of creating a simple multilingual JSF page consisting of JSF Flex tags. (June 29, Article)

The Rules of SOA - A Road to a Successful SOA Implementation

In this session Jeff explores the key characteristics of successful SOA projects. He covers some of the patterns, and anti-patterns, tool sets, and strategies that he himself learned the hard way. Last, he provides a strategy and blueprint for achieving a high likelihood of success in your SOA project. (June 23, Tech Talk)

Ari Zilka Talks About Terracotta 3.1

Ari Zilka, CTO of Terracotta, Inc., talks about the new features in Terracotta 3.1, announced during JavaOne and available now. (June 15, Tech Talk)

Enterprise Application Integration, and Spring

In this Tech Talk, Josh Long explores an integration challenge using Spring Integration and walks through the implementation, employing and expanding on the basic patterns of Enterprise Application Integration to tie together components into a function integration solution, and then demonstrates how Spring Integration helps address the integration requirements. (June 15, Tech Talk)

Google Web Toolkit: An Introduction

In this Tech Talk, David Geary teaches you: The basics of Google Web Toolkit; How to implement Ajax-enabled applications in Java; Internationalization; Hooking into the browser history mechanism; Remote procedure calls. (June 4, Tech Talk)

Just Enough Early Architecture to Guide Development

Jon Kern discusses the best architecture/technical solutions and ensure that they are repeated by all developers. By tackling the architecture up-front in a serial manner, subsequent parallel development will be much more manageable and predictable. (May 28, Tech Talk)

Productive Programmer: On the Lam from the Furniture Police

This keynote describes the frustrations of modern knowledge workers in their quest to actually get some work done, and solutions for how to guard yourself against all those distractions. Neal Ford talks about environments, coding, acceleration, automation, and avoiding repetition as ways to defeat the misguided attempts to sap your ability to produce good work. (May 26, Tech Talk)

Auto-Scaling Your Existing Web Application

Gil demonstrates how new, aggressive uses of already abundant compute capacity by common applications offer competitive value for application designers. (May 21, Tech Talk)

Automating Hibernate Mapping and Queries For Java Web Development

Chris Keene introduces WaveMaker as a new way to automate the ability to generate Hibernate classes in order to more quickly bring OR mapping into an application. (May 19, Article)

Free Book PDF Download: Mastering EJB Third Edition

Mastering EJB was one of the original and most influential EJB books in the industry. Mastering EJB III now returns with two new expert co-authors, updated for EJB 2.1 and 30% new chapters including security, integration, best practices, open source, and more.
(Book PDF Download)

Application Server Matrix

The Application Server Matrix is a detailed listing of J2EE vendors and their application server products, with information on latest version numbers, J2EE spec support and licensing, pricing, platform support, and links to product downloads and reviews.
(Application Server Comparison Matrix)

News | Blogs | Discussions | Tech talks | Patterns | Reviews | White Papers | Downloads | Articles | Media kit | About
Java Solutions
All Content Copyright ©2007 TheServerSide Privacy Policy
Site Map