Java Development News:

Lucene In Action

By Joaquin Delgado

01 Mar 2005 | TheServerSide.com

This is a well written book in conception and execution. The authors have taken a mature open-source tool such as Lucene and a set of ideas worked out by Lucene’s users and developers community over the past 4+ years and crystallized them into a sustained and consistent presentation. This is not a collection of technical articles and/or FAQs about Lucene but rather a complete new volume that avoids usual repetitions and notational inconsistencies that one might expect out of bits and pieces. Otis and Erik, who are renowned Lucene experts and project committers, have been able to synthesize and convey the technical expertise, dedication and work of the community of world class engineers, starting with Doug Cutting, that have graciously contributed their code and knowledge to the Lucene project, which has once again proven the power of open source.

The authors review and condense a wide range of Lucene’s core material, including indexing, searching, analysis (tokenization), sorting/filtering, span queries, term vectors and other advanced search techniques; as well as applied Lucene material, such as document parsing, tools and extensions, Lucene ports and case studies, to put together the most comprehensive and authoritative guide to Lucene ever published.

The topics are introduced in an iterative and problem-triggered manner: Problems are presented, followed by concepts and methods that can overcome them but might introduce new problems followed by ways of solving the new problems and so on. In this way, readers are naturally attracted to follow the logical path leading to the discovery of Lucene and are exposed to the advantages and beauty of Lucene along the way.

This book is targeted primarily for readers interested in information retrieval and using Lucene for building search applications but it is also a great reading exercise for computer scientists, students and programmers in general. The reader will learn, through the concepts and code presented in this book, to appreciate the power of using (software) building blocks and the effectiveness of learning from examples. All this, while keeping in mind the principle that simple designs and solutions for complex “real-world” problems usually work the best.

The book is composed of two very distinct parts with a total of ten illustrative chapters in total:

Part 1, Core Lucene (chapters 1 through 6), gives a gentle introduction to Lucene, its core classes and explores the world of indexing and searching made simple. The heart of Lucene is then exposed through the use of the API for the analysis, storage, manipulation, scoring and retrieval of text (documents and fields) as part of the core Lucene library. The authors progressively present examples of use, ranging from the basics to the most advance techniques now available in Lucene 1.4. I found chapter 4 (Analysis) and 5 (Advance searching techniques) the most groundbreaking and particularly interesting from the problem-solving standpoint, since most of the functionalities and particular solutions described in these chapters are an attempt to solve specific and sometime hard-to-solve problems presented by the developer community.

Part 2, Applied Lucene (chapters 7 through 10), explores document preprocessing, the Lucene sandbox and different ports of Lucene, finalizing with a fascinating set of case studies. After reading the initial chapters of Part 1 and if you have some Lucene experience, this is probably the place to go to understand how other people have come about using Lucene for their particular applications.

All throughout the book there are clear and practical code examples that show how and where the Lucene API is used. Each step is highlighted, numbered and explained in detail. The simplicity of the examples and the corresponding explanation makes it an easy and understandable read even for non Java programmers.

Although extremely well written, the book is not without weakness. As in any publication there is definitely room for improvements, but as always this is very subjective. Here are my comments:

The authors try to make the case in favor of Test Driven Development and Test Driven Learning in the section “about this book”, but it is unclear why compromise the readability of the example code by imposing the JUnit framework and the use of assertions, thus adding an extra layer of complexity. Although it is indisputable that development and testing must come together, it is common practice to leave the final code clear of debugging and testing statements, primarily so it can be readily understood and reused as-is by other developers. This is more the case if the source code is available for download. Maybe two versions, one with test cases and one without should be prepared at least for book’s additional material.

Although probably subject of a different book, some metrics of indexing and search performance based on individual Lucene indexes in relation with number/volume of documents and the corresponding hardware requirements based on the understanding of the core data structures and memory/IO requirements, was essentially omitted – well, some real-life hardware specs were mentioned in the case studies. I believe this is a very central issue for the design and actual deployment of any scalable search architecture based on Lucene and thus should deserve special attention. I know providing the right architecture and specifying the basic requirements for an application is a science all by itself but having some guidelines would be extremely helpful for people seriously thinking of incorporating Lucene into their projects and products.

I also feel that more direct references to basic Information Retrieval (IR) books, tutorials and papers are missing from the resource section (Appendix C). For example, Doug Cutting’s publications, which are listed, are useful but perhaps too advanced for those readers that just want to expand their knowledge of the common theory behind Lucene and other search engines.

Finally, a brief of figures, tables, equations and listings could be useful if this book is to be used as a reference guide.

In conclusion, this book is must read for anyone who wants to learn about Lucene or is even considering embedding search into their applications or just wants to learn about information retrieval in general. Highly recommended!

I enjoyed reading it myself and was pleasantly surprised by the quality of its content and editorial layout. I also think Manning was also the right choice to publish this book, reaffirming once again its position in the applied computer science technical publication space.

Biography

Joaquin Delgado - Chief Technology Officer of TripleHop Technologies Inc., a New York based context-sensitive enterprise search company. Dr. Delgado is a specialist of information filtering and retrieval on the Web with background on multi-agent systems, collaborative filtering/recommender systems, semantic networks (ontologies)and statistical NLP; he has published extensively on these topics and has been a guest speaker at numerous academic and industry conferences in the U.S. and around the world. During his time in Japan (1994-2000) he provided consulting services to Telematrix and Linc Media Inc. in Tokyo, handling and supervising complex Internet projects for companies pioneering in e-commerce and voice over IP. Prior to moving to Japan, Dr. Delgado worked for Editora Ferga in Caracas, Venezuela, where he conceived and designed the first Venezuelan export search and directory Web site. He also worked at Oracle, where he developed extensive database design and implementation expertise. Dr. Delgado holds a PhD in computer science and artificial intelligence from Nagoya Institute of Technology, Japan, and a degree in computer engineering from Universidad Simon Bolivar, Caracas, Venezuela.