The search engine that TheServerSide used a year ago was pretty poor. It would build an index by spidering the site which resulted in poor search results that were out of context. A year ago we did something about it, and implemented a Lucene based solution. Dion Almaer implemented that solution and wrote an article that walks through the process, and shows how we plugged it into our search solution.
Using Lucene allowed our search to gain relevance, speed, and power with this approach. We can tweak the way we index and search our content with little effort.
Read I Love Lucene
-
I Love Lucene (on TheServerSide) (34 messages)
- Posted by: Floyd Marinescu
- Posted on: January 12 2005 00:58 EST
Threaded Messages (34)
- good article by thoff thoff on January 13 2005 13:06 EST
- View the source of this page for a clue... by John Reynolds on January 13 2005 15:07 EST
- Tapestry on TSS ! by Vagif Verdi on January 13 2005 03:21 EST
-
How about ... by analog boy on January 14 2005 04:52 EST
- How about ... by Ioan C on January 14 2005 11:23 EST
-
How about ... by Cameron Purdy on January 15 2005 01:32 EST
-
TSS Architecture by Howard Lewis Ship on January 16 2005 09:37 EST
- TSS Architecture by Raymond Xu on January 19 2005 01:44 EST
-
TSS Architecture by Howard Lewis Ship on January 16 2005 09:37 EST
- How about ... by Floyd Marinescu on January 17 2005 12:27 EST
- Tapestery by Muhammad Mansoor on January 14 2005 05:19 EST
- Re: good article by sachin joshi on August 07 2006 03:10 EDT
- View the source of this page for a clue... by John Reynolds on January 13 2005 15:07 EST
- I Love Lucene (on TheServerSide) by Vic Cekvenich on January 13 2005 15:27 EST
- Nutch, Lucene by Jason McKerr on January 13 2005 16:02 EST
- Cool by T Wilson on January 13 2005 16:15 EST
- I Love Lucene (on TheServerSide) by bruce deen on January 13 2005 16:54 EST
- Other formats by Gordon Harding on January 18 2005 16:26 EST
- Other formats by Erik Hatcher on January 19 2005 11:31 EST
- Other formats by Gordon Harding on January 18 2005 16:26 EST
- Dion rocks by Erik Hatcher on January 13 2005 16:57 EST
- try searching this site for lucene by Dave C on January 13 2005 17:33 EST
- Re: try searching this site for lucene by Erik Hatcher on January 13 2005 19:23 EST
- Re: try searching this site for lucene by Erik Hatcher on January 14 2005 04:46 EST
- Re: try searching this site for lucene by Erik Hatcher on January 13 2005 19:23 EST
- Index storage by Irakli Nadareishvili on January 18 2005 01:47 EST
- Could Lucence fix the page widths on this site? by tr mo on January 18 2005 17:14 EST
- Could Lucence fix the page widths on this site? by Raymond Xu on January 19 2005 01:26 EST
- Red-Piranha - Open Source , Lucene Based Search by paul browne on January 19 2005 08:41 EST
- Incremental build Q by Abhishek Shah on February 22 2005 15:22 EST
- Serializing IndexWriter.addDocument() calls by Branko Milosavljevic on March 02 2005 05:26 EST
- Lucene Search Summaries. by Rob Kenworthy on April 10 2005 03:01 EDT
- Lucene Search Summaries. by Branko Milosavljevic on May 30 2005 04:48 EDT
- oracle text vs ;ucene search by seema bhat on August 24 2005 03:02 EDT
- Lucene Search Summaries. by Rob Kenworthy on April 10 2005 03:01 EDT
- Rankings by Jarrod Cuzens on August 24 2005 06:09 EDT
- Lucene performance and scalability by chunseo choi on September 28 2005 12:00 EDT
- I Love Lucene (on TheServerSide) by vijendra singh on April 24 2006 10:30 EDT
- i need help with lucene formula by fghfgh fghdgf on October 10 2006 18:27 EDT
-
good article[ Go to top ]
- Posted by: thoff thoff
- Posted on: January 13 2005 13:06 EST
- in response to Floyd Marinescu
Thanks, that was very useful. You mentioned you use JSP for legacy reasons, what would you use now? -
View the source of this page for a clue...[ Go to top ]
- Posted by: John Reynolds
- Posted on: January 13 2005 15:07 EST
- in response to thoff thoff
TSS switched to Tapestry a while back. -
Tapestry on TSS ![ Go to top ]
- Posted by: Vagif Verdi
- Posted on: January 13 2005 15:21 EST
- in response to John Reynolds
Wow ! Why nobody brought it into the light ?
This is so freaking exciting ! :)) -
How about ...[ Go to top ]
- Posted by: analog boy
- Posted on: January 14 2005 04:52 EST
- in response to John Reynolds
If they have I'd be interested in reading about that part of their architecture too? Are you busy at the moment Dion ;)
Seriously, the site does take a respectible number of hits ad it would be interesting to learn more about the architecture beyond the application server config and search engine. -
How about ...[ Go to top ]
- Posted by: Ioan C
- Posted on: January 14 2005 11:23 EST
- in response to analog boy
In the same subject: what time until a post is indexed? Probably there is a thread that scans the DB for new inserts.. Is there a scheduler with it? -
How about ...[ Go to top ]
- Posted by: Cameron Purdy
- Posted on: January 15 2005 13:32 EST
- in response to analog boy
If they have I'd be interested in reading about that part of their architecture too? Are you busy at the moment Dion ;)Seriously, the site does take a respectible number of hits ad it would be interesting to learn more about the architecture beyond the application server config and search engine.
Last I had heard: Tapestry for the UI, Postgres for the DB, Coherence for clustered data caching, running on a heterogenous cluster of several different app servers.
Peace,
Cameron Purdy
Tangosol, Inc.
Coherence: Shared Memories for J2EE Clusters -
TSS Architecture[ Go to top ]
- Posted by: Howard Lewis Ship
- Posted on: January 16 2005 09:37 EST
- in response to Cameron Purdy
Tapestry for UI
PostgreSQL for database
Solarmetric KODO for object relational mapping
Tangosol Coherence for cluster wide cache
Lucene for text search
In addition, a significant part of the infrastructure is based on HiveMind; prominently, the logic that accepts the old format URLs and service-side-forwards to the corresponding Tapestry pages (which is why nobodies noticed the change over, which is by design).
Runs on a two-server cluster of WebLogic servers, on RedHat Linux. The heterogenous server approach was an experiment a ways back.
A short article about the Tapestry transformation is in the queue. -
TSS Architecture[ Go to top ]
- Posted by: Raymond Xu
- Posted on: January 19 2005 01:44 EST
- in response to Howard Lewis Ship
the logic that accepts the old format URLs and service-side-forwards to the corresponding Tapestry pages (which is why nobodies noticed the change over, which is by design).
Guru, this is really cool. But I saw the Tapestry style link after posting... Anyway, will this feature be part of next Tapestry release? -
How about ...[ Go to top ]
- Posted by: Floyd Marinescu
- Posted on: January 17 2005 12:27 EST
- in response to analog boy
Guys, an article on the Tapestry migration is coming up this week. -
Tapestery[ Go to top ]
- Posted by: Muhammad Mansoor
- Posted on: January 14 2005 05:19 EST
- in response to thoff thoff
Tapestry :) -
Re: good article[ Go to top ]
- Posted by: sachin joshi
- Posted on: August 07 2006 03:10 EDT
- in response to thoff thoff
Ya.. Good article. Let me know how tp prepare the query. I want the query like "text:hi And category:Home". Which type of query should I use? How to prepare such queries? Quick help is appreciated. Thanks -
I Love Lucene (on TheServerSide)[ Go to top ]
- Posted by: Vic Cekvenich
- Posted on: January 13 2005 15:27 EST
- in response to Floyd Marinescu
I love Lucene too.
(my example is RiA using remote Lucene service, you can search Struts, JDNC, Tomcat mail lists - dev version at boardVU.com; you can see Lucene score of your search. I just made Lucene into a DAO similar to iBatis DAO )
.V -
Nutch, Lucene[ Go to top ]
- Posted by: Jason McKerr
- Posted on: January 13 2005 16:02 EST
- in response to Floyd Marinescu
The OSL recently worked with Oregon State University's Central Web Services department replacing it's google appliance with Nutch, which is based on Lucene. OSU get's great results and the Net Present Value as about $471,000.
We love Lucene too.
Jason McKerr
The Open Source Lab
"Open Minds. Open Doors. Open Source." -
Cool[ Go to top ]
- Posted by: T Wilson
- Posted on: January 13 2005 16:15 EST
- in response to Floyd Marinescu
I've had some success with Lucene as well. -
I Love Lucene (on TheServerSide)[ Go to top ]
- Posted by: bruce deen
- Posted on: January 13 2005 16:54 EST
- in response to Floyd Marinescu
I've had excellent results using Lucene for my organization. People love the fact that I can add functionality for almost any document, pdf, word excel... I like the fact that it's fast and easy to use, whether over multiple indexes or over one, and the query language is fantastic. I've even used it via an internal web service for a delphi app that lives ontop of a file system. The search is much more advance than the windows search it's unbelievable. -
Other formats[ Go to top ]
- Posted by: Gordon Harding
- Posted on: January 18 2005 16:26 EST
- in response to bruce deen
I've had excellent results using Lucene for my organization. People love the fact that I can add functionality for almost any document, pdf, word excel... I like the fact that it's fast and easy to use, whether over multiple indexes or over one, and the query language is fantastic. I've even used it via an internal web service for a delphi app that lives ontop of a file system. The search is much more advance than the windows search it's unbelievable.
How are you able to read formats such as pdf and excel? Is there technology built in to Lucene or is that external? -
Other formats[ Go to top ]
- Posted by: Erik Hatcher
- Posted on: January 19 2005 11:31 EST
- in response to Gordon Harding
How are you able to read formats such as pdf and excel? Is there technology built in to Lucene or is that external?
Lucene deals with text, and text only. It is the developers responsibility to implement parsing of other formats before handing it to Lucene. There are a number of 3rd party libraries available to make this easy though. Chapter 7 of Lucene in Action covers how to integrate many of these libraries. PDFBox does a great job with PDF files. POI works with Excel. TextMining works with Word. These are just some of the options available. The source code distribution of Lucene in Action has an easily runnable example of parsing various file formats (see the README) and a slick extensible framework that can be used to abstract file handling details. -
Dion rocks[ Go to top ]
- Posted by: Erik Hatcher
- Posted on: January 13 2005 16:57 EST
- in response to Floyd Marinescu
Otis and I were honored to have Dion's case study contributed to "Lucene in Action".
Can anyone spot an issue in his code? We footnoted it in our book - sorry Dion! :)
http://www.lucenebook.com/search?query=tss%20issue
A quick note about our lucenebook.com site - we decided to have some fun and actually put something useful to owners of our book, as well as show non-owners the value of Lucene. It's a blog integrated with a "search inside" the book contents feature. The Table of Contents page is dynamic - blog entries that refer to particular sections automatically appear in the right place (currently two cosmetic errata items there).
And, lucenebook.com is Tapestry too - can't you tell from the URLs? :) The blog is blojsom, but the other pages (search results and TOC currently) are Tapestry pages. -
try searching this site for lucene[ Go to top ]
- Posted by: Dave C
- Posted on: January 13 2005 17:33 EST
- in response to Floyd Marinescu
This particular page doesn't even show up on the first page of results. Hmmm... -
Re: try searching this site for lucene[ Go to top ]
- Posted by: Erik Hatcher
- Posted on: January 13 2005 19:23 EST
- in response to Dave C
This particular page doesn't even show up on the first page of results. Hmmm...
I bet it will tomorrow. I believe TSS indexes nightly. -
Re: try searching this site for lucene[ Go to top ]
- Posted by: Erik Hatcher
- Posted on: January 14 2005 04:46 EST
- in response to Erik Hatcher
I bet it will tomorrow. I believe TSS indexes nightly.
And sure enough, a search for "lucene" shows the article as the 2nd result at the time of writing. -
Index storage[ Go to top ]
- Posted by: Irakli Nadareishvili
- Posted on: January 18 2005 01:47 EST
- in response to Floyd Marinescu
Wonderful article!
But something is missing :)
Look at this:
<!-- The path to where the search index is kept -->
<index-location windows="/temp/tss-searchindex" unix="/tss/searchindex" />
Indexes are kept in filesystem, yet (from what we know) TSS is a clustered architecture. So, how does that work? Is there a dedicated server for the search? Or are indexes on a shared network drive?
Managing Lucene indexes in a distributed application is an interesting topic. I know there is a Lucene-DB package but 1) It needs tweaking. It's no Hibernate to just plug into any database and get ready for usage 2) I do think storing indexes in a DB may be unnecessary performance degradation. Looking at the Lucene-DB code, itself - all it does is simulated filesystem in a database creating a layer which makes Lucene think it still works with files. But resources are being spent on network connection and DB query, while doing this, so...
What did TSS do? Any advice, maybe, from Erik? I read his book but did not find answer to this (maybe I missed, I had to just scan due to the lack of time). -
Could Lucence fix the page widths on this site?[ Go to top ]
- Posted by: tr mo
- Posted on: January 18 2005 17:14 EST
- in response to Floyd Marinescu
This web site won't fit on 1024x760 screen. Does Jakarta have something to fix that? Maybe common sense? -
Could Lucence fix the page widths on this site?[ Go to top ]
- Posted by: Raymond Xu
- Posted on: January 19 2005 01:26 EST
- in response to tr mo
This web site won't fit on 1024x760 screen. Does Jakarta have something to fix that? Maybe common sense?
Bro, this site fits well on my 1024x768 notebook screen, with a browser called FireFox. -
Red-Piranha - Open Source , Lucene Based Search[ Go to top ]
- Posted by: paul browne
- Posted on: January 19 2005 08:41 EST
- in response to Floyd Marinescu
Great Article guys - I wish you had written it earlier as it is very useful!
A Lucene based open source project that follows broadly the outlines discussed in the article is Red-Piranha - http://red-piranha.sourceforge.net/. It uses Spring as it's MVC framework , and can run via Servlet (Tomcat) , GUI or scripted via the command line.
Some suggested uses include web site or intranet search engine, integration into your J2EE Development project , and knowledge and Document management.
As well as the search functionality , Red-Piranha also includes the ability to 'learn' what the user wants , and so improve future searches. -
Incremental build Q[ Go to top ]
- Posted by: Abhishek Shah
- Posted on: February 22 2005 15:22 EST
- in response to Floyd Marinescu
The "I Love Lucene" Article stated :
"Indexing our data is so fast, that we don’t even need to run the incremental build plan that we developed. At one point we mistakenly had an IndexWriter.optimize() call every time we added a document. When we relaxed that to run less frequently we brought down the index time to a matter of seconds. It used to take a LOT longer, even as long as 45 minutes." Could someone elaborate on this...So there is no incremental build plan...How does one take care of defunct links ?... -
Serializing IndexWriter.addDocument() calls[ Go to top ]
- Posted by: Branko Milosavljevic
- Posted on: March 02 2005 05:26 EST
- in response to Floyd Marinescu
The article hasn't explained how did you manage to serialize IndexWriter.addDocument() method calls. AFAIK, these calls must be serialized (and invoked in a single thread, probably) in order to have a consistent Lucene index. Invoking Lucene indexer inside a MDB sounds like a good idea at first, but I'm wondering how to ensure this serialization. -
Lucene Search Summaries.[ Go to top ]
- Posted by: Rob Kenworthy
- Posted on: April 10 2005 03:01 EDT
- in response to Branko Milosavljevic
Hi all,
I also love Lucene... but there is still quite a bit of code that a developer has to put together to build a web site. Currently, I am building a search facility for jsourcery.com using Lucene. A problem that I have more or less just overcome is providing sensible looking summaries for search results. Has anybody else had the same frustration?
To elaborate, I wanted to have google-esque search result summaries that display segments of the search result text in bold. Furthermore, I wanted to show the most relevant parts of the text such that it could be determined if a particular search result is useful without actually having to look at the page. From searching the web, I have found no such library to do this for me. So, I am now finalizing some code that uses a clustering algorithm to identify the most interesting segments of a search result and lump them together into a summary.
It seems to me that it would be natural for this functionality to be included in the Lucene distribution, but according to Google, it doesn't seem like much of a hot topic.
Has anyone else had to do the same as me or is there such a library out there that I have missed? -
Lucene Search Summaries.[ Go to top ]
- Posted by: Branko Milosavljevic
- Posted on: May 30 2005 04:48 EDT
- in response to Rob Kenworthy
Check out Lucene Sandbox: http://lucene.apache.org/java/docs/lucene-sandbox/
There is a term highlighter module. -
oracle text vs ;ucene search[ Go to top ]
- Posted by: seema bhat
- Posted on: August 24 2005 15:02 EDT
- in response to Rob Kenworthy
I would like to know how Lucene is advantageous over Advanced features of Oracle 9i text.
My application manages huge content and assets upto 2TB size now, and the db size is growing very fast. We plan to use oracle9i interMedia to store assets(word, pdf, pt, xls , jpg,eps, gif files etc). The db stores assets for each feature film title , keyarts, stills etc. the content is presently in english and spanish languages. The application also contains other assets sections and some modules like news, press release etc.
The user can do advanced search in any of these sections. The user should be limited to get search results based on thier permission (like news only and not press). The user should see search results in their preferred language (when content for that language exists). User can get access to the content or the asset/files (inlcuding image,video,pdf). The appserver is WebSphere in clustered environment.
For this data volume which search should I be using - Lucene or Oracle ? can i get a comparsion interms of performance, usability, scalability considering future db growth.
What are the best practices or guidelines to implement such sitewide search ? -
Rankings[ Go to top ]
- Posted by: Jarrod Cuzens
- Posted on: August 24 2005 06:09 EDT
- in response to Floyd Marinescu
What I am really interested in is how the rankings were accomplished. I have been struggling myself with getting the document field rankings to work properly. In the article the author mentions how they created boosts for the dates but there is no code etc. to explain this. Does anybody know how this works? TIA -
Lucene performance and scalability[ Go to top ]
- Posted by: chunseo choi
- Posted on: September 28 2005 12:00 EDT
- in response to Floyd Marinescu
How many documents are being indexed for TSS search?
I'm thinking of using Lucene for a project with over 10 million documents.
I'm concerned with Lucene's indexing and searching performance. Does anyone have experience using Lucene with large index files? -
I Love Lucene (on TheServerSide)[ Go to top ]
- Posted by: vijendra singh
- Posted on: April 24 2006 10:30 EDT
- in response to Floyd Marinescu
I am interested to use Lucene in my application for searching.Very new in this all.If i have to use database instead of file system;is it so that directly data fething is more than enough.I have taken some deployable .war filefrom "www.dbsight.net" they have used queries for fetching data from database and after that established Index on that basis for searching mechanism.Finally that sample war has not given any output as it was having problem to be deployed in Tomcat server.So i don't know which kind of query i should pass there to get output.
My requirement is data sholud be reflected in case any similar query or word is given for searching.(may be done appending automatically etc "as select * from table where nam like "%s%" something like this.)
Please if possible give me the exact picture,so that i can also get benefitted from this.Many people told Lucene.
Any more clarifications required,welcome.
Thanks
Vijendra -
i need help with lucene formula[ Go to top ]
- Posted by: fghfgh fghdgf
- Posted on: October 10 2006 18:27 EDT
- in response to Floyd Marinescu
HI iam a student in rabat and iam working on search engines and i wnt to understand the formula of lucene http://www.lucenebook.com/blog/errata/2005/01/24/ THANKS FOR HELP you contact me here THANKS