I Love Lucene (on TheServerSide)

Discussions

News: I Love Lucene (on TheServerSide)

  1. I Love Lucene (on TheServerSide) (34 messages)

    The search engine that TheServerSide used a year ago was pretty poor. It would build an index by spidering the site which resulted in poor search results that were out of context. A year ago we did something about it, and implemented a Lucene based solution. Dion Almaer implemented that solution and wrote an article that walks through the process, and shows how we plugged it into our search solution.

    Using Lucene allowed our search to gain relevance, speed, and power with this approach. We can tweak the way we index and search our content with little effort.

    Read I Love Lucene

    Threaded Messages (34)

  2. good article[ Go to top ]

    Thanks, that was very useful. You mentioned you use JSP for legacy reasons, what would you use now?
  3. TSS switched to Tapestry a while back.
  4. Tapestry on TSS ![ Go to top ]

    Wow ! Why nobody brought it into the light ?
    This is so freaking exciting ! :))
  5. How about ...[ Go to top ]

    If they have I'd be interested in reading about that part of their architecture too? Are you busy at the moment Dion ;)

    Seriously, the site does take a respectible number of hits ad it would be interesting to learn more about the architecture beyond the application server config and search engine.
  6. How about ...[ Go to top ]

    In the same subject: what time until a post is indexed? Probably there is a thread that scans the DB for new inserts.. Is there a scheduler with it?
  7. How about ...[ Go to top ]

    If they have I'd be interested in reading about that part of their architecture too? Are you busy at the moment Dion ;)Seriously, the site does take a respectible number of hits ad it would be interesting to learn more about the architecture beyond the application server config and search engine.

    Last I had heard: Tapestry for the UI, Postgres for the DB, Coherence for clustered data caching, running on a heterogenous cluster of several different app servers.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Shared Memories for J2EE Clusters
  8. TSS Architecture[ Go to top ]

    Tapestry for UI
    PostgreSQL for database
    Solarmetric KODO for object relational mapping
    Tangosol Coherence for cluster wide cache
    Lucene for text search

    In addition, a significant part of the infrastructure is based on HiveMind; prominently, the logic that accepts the old format URLs and service-side-forwards to the corresponding Tapestry pages (which is why nobodies noticed the change over, which is by design).

    Runs on a two-server cluster of WebLogic servers, on RedHat Linux. The heterogenous server approach was an experiment a ways back.

    A short article about the Tapestry transformation is in the queue.
  9. TSS Architecture[ Go to top ]

    the logic that accepts the old format URLs and service-side-forwards to the corresponding Tapestry pages (which is why nobodies noticed the change over, which is by design).

    Guru, this is really cool. But I saw the Tapestry style link after posting... Anyway, will this feature be part of next Tapestry release?
  10. How about ...[ Go to top ]

    Guys, an article on the Tapestry migration is coming up this week.
  11. Tapestery[ Go to top ]

    Tapestry :)
  12. Re: good article[ Go to top ]

    Ya.. Good article. Let me know how tp prepare the query. I want the query like "text:hi And category:Home". Which type of query should I use? How to prepare such queries? Quick help is appreciated. Thanks
  13. I Love Lucene (on TheServerSide)[ Go to top ]

    I love Lucene too.

    (my example is RiA using remote Lucene service, you can search Struts, JDNC, Tomcat mail lists - dev version at boardVU.com; you can see Lucene score of your search. I just made Lucene into a DAO similar to iBatis DAO )


    .V
  14. Nutch, Lucene[ Go to top ]

    The OSL recently worked with Oregon State University's Central Web Services department replacing it's google appliance with Nutch, which is based on Lucene. OSU get's great results and the Net Present Value as about $471,000.

    We love Lucene too.

    Jason McKerr
    The Open Source Lab
    "Open Minds. Open Doors. Open Source."
  15. Cool[ Go to top ]

    I've had some success with Lucene as well.
  16. I Love Lucene (on TheServerSide)[ Go to top ]

    I've had excellent results using Lucene for my organization. People love the fact that I can add functionality for almost any document, pdf, word excel... I like the fact that it's fast and easy to use, whether over multiple indexes or over one, and the query language is fantastic. I've even used it via an internal web service for a delphi app that lives ontop of a file system. The search is much more advance than the windows search it's unbelievable.
  17. Other formats[ Go to top ]

    I've had excellent results using Lucene for my organization. People love the fact that I can add functionality for almost any document, pdf, word excel... I like the fact that it's fast and easy to use, whether over multiple indexes or over one, and the query language is fantastic. I've even used it via an internal web service for a delphi app that lives ontop of a file system. The search is much more advance than the windows search it's unbelievable.

    How are you able to read formats such as pdf and excel? Is there technology built in to Lucene or is that external?
  18. Other formats[ Go to top ]

    How are you able to read formats such as pdf and excel? Is there technology built in to Lucene or is that external?

    Lucene deals with text, and text only. It is the developers responsibility to implement parsing of other formats before handing it to Lucene. There are a number of 3rd party libraries available to make this easy though. Chapter 7 of Lucene in Action covers how to integrate many of these libraries. PDFBox does a great job with PDF files. POI works with Excel. TextMining works with Word. These are just some of the options available. The source code distribution of Lucene in Action has an easily runnable example of parsing various file formats (see the README) and a slick extensible framework that can be used to abstract file handling details.
  19. Dion rocks[ Go to top ]

    Otis and I were honored to have Dion's case study contributed to "Lucene in Action".

    Can anyone spot an issue in his code? We footnoted it in our book - sorry Dion! :)

    http://www.lucenebook.com/search?query=tss%20issue

    A quick note about our lucenebook.com site - we decided to have some fun and actually put something useful to owners of our book, as well as show non-owners the value of Lucene. It's a blog integrated with a "search inside" the book contents feature. The Table of Contents page is dynamic - blog entries that refer to particular sections automatically appear in the right place (currently two cosmetic errata items there).

    And, lucenebook.com is Tapestry too - can't you tell from the URLs? :) The blog is blojsom, but the other pages (search results and TOC currently) are Tapestry pages.
  20. try searching this site for lucene[ Go to top ]

    This particular page doesn't even show up on the first page of results. Hmmm...
  21. Re: try searching this site for lucene[ Go to top ]

    This particular page doesn't even show up on the first page of results. Hmmm...

    I bet it will tomorrow. I believe TSS indexes nightly.
  22. I bet it will tomorrow. I believe TSS indexes nightly.

    And sure enough, a search for "lucene" shows the article as the 2nd result at the time of writing.
  23. Index storage[ Go to top ]

    Wonderful article!

    But something is missing :)

    Look at this:
    <!-- The path to where the search index is kept -->
    <index-location windows="/temp/tss-searchindex" unix="/tss/searchindex" />

    Indexes are kept in filesystem, yet (from what we know) TSS is a clustered architecture. So, how does that work? Is there a dedicated server for the search? Or are indexes on a shared network drive?

    Managing Lucene indexes in a distributed application is an interesting topic. I know there is a Lucene-DB package but 1) It needs tweaking. It's no Hibernate to just plug into any database and get ready for usage 2) I do think storing indexes in a DB may be unnecessary performance degradation. Looking at the Lucene-DB code, itself - all it does is simulated filesystem in a database creating a layer which makes Lucene think it still works with files. But resources are being spent on network connection and DB query, while doing this, so...

    What did TSS do? Any advice, maybe, from Erik? I read his book but did not find answer to this (maybe I missed, I had to just scan due to the lack of time).
  24. This web site won't fit on 1024x760 screen. Does Jakarta have something to fix that? Maybe common sense?
  25. This web site won't fit on 1024x760 screen. Does Jakarta have something to fix that? Maybe common sense?

    Bro, this site fits well on my 1024x768 notebook screen, with a browser called FireFox.
  26. Great Article guys - I wish you had written it earlier as it is very useful!

    A Lucene based open source project that follows broadly the outlines discussed in the article is Red-Piranha - http://red-piranha.sourceforge.net/. It uses Spring as it's MVC framework , and can run via Servlet (Tomcat) , GUI or scripted via the command line.

    Some suggested uses include web site or intranet search engine, integration into your J2EE Development project , and knowledge and Document management.

    As well as the search functionality , Red-Piranha also includes the ability to 'learn' what the user wants , and so improve future searches.
  27. Incremental build Q[ Go to top ]

    The "I Love Lucene" Article stated :

    "Indexing our data is so fast, that we don’t even need to run the incremental build plan that we developed. At one point we mistakenly had an IndexWriter.optimize() call every time we added a document. When we relaxed that to run less frequently we brought down the index time to a matter of seconds. It used to take a LOT longer, even as long as 45 minutes." Could someone elaborate on this...So there is no incremental build plan...How does one take care of defunct links ?...
  28. The article hasn't explained how did you manage to serialize IndexWriter.addDocument() method calls. AFAIK, these calls must be serialized (and invoked in a single thread, probably) in order to have a consistent Lucene index. Invoking Lucene indexer inside a MDB sounds like a good idea at first, but I'm wondering how to ensure this serialization.
  29. Lucene Search Summaries.[ Go to top ]

    Hi all,

    I also love Lucene... but there is still quite a bit of code that a developer has to put together to build a web site. Currently, I am building a search facility for jsourcery.com using Lucene. A problem that I have more or less just overcome is providing sensible looking summaries for search results. Has anybody else had the same frustration?

    To elaborate, I wanted to have google-esque search result summaries that display segments of the search result text in bold. Furthermore, I wanted to show the most relevant parts of the text such that it could be determined if a particular search result is useful without actually having to look at the page. From searching the web, I have found no such library to do this for me. So, I am now finalizing some code that uses a clustering algorithm to identify the most interesting segments of a search result and lump them together into a summary.

    It seems to me that it would be natural for this functionality to be included in the Lucene distribution, but according to Google, it doesn't seem like much of a hot topic.

    Has anyone else had to do the same as me or is there such a library out there that I have missed?
  30. Check out Lucene Sandbox: http://lucene.apache.org/java/docs/lucene-sandbox/

    There is a term highlighter module.
  31. oracle text vs ;ucene search[ Go to top ]

    I would like to know how Lucene is advantageous over Advanced features of Oracle 9i text.
    My application manages huge content and assets upto 2TB size now, and the db size is growing very fast. We plan to use oracle9i interMedia to store assets(word, pdf, pt, xls , jpg,eps, gif files etc). The db stores assets for each feature film title , keyarts, stills etc. the content is presently in english and spanish languages. The application also contains other assets sections and some modules like news, press release etc.
    The user can do advanced search in any of these sections. The user should be limited to get search results based on thier permission (like news only and not press). The user should see search results in their preferred language (when content for that language exists). User can get access to the content or the asset/files (inlcuding image,video,pdf). The appserver is WebSphere in clustered environment.

    For this data volume which search should I be using - Lucene or Oracle ? can i get a comparsion interms of performance, usability, scalability considering future db growth.
     What are the best practices or guidelines to implement such sitewide search ?
  32. Rankings[ Go to top ]

    What I am really interested in is how the rankings were accomplished. I have been struggling myself with getting the document field rankings to work properly. In the article the author mentions how they created boosts for the dates but there is no code etc. to explain this. Does anybody know how this works? TIA
  33. Lucene performance and scalability[ Go to top ]

    How many documents are being indexed for TSS search?
    I'm thinking of using Lucene for a project with over 10 million documents.

    I'm concerned with Lucene's indexing and searching performance. Does anyone have experience using Lucene with large index files?
  34. I am interested to use Lucene in my application for searching.Very new in this all.If i have to use database instead of file system;is it so that directly data fething is more than enough.I have taken some deployable .war filefrom "www.dbsight.net" they have used queries for fetching data from database and after that established Index on that basis for searching mechanism.Finally that sample war has not given any output as it was having problem to be deployed in Tomcat server.So i don't know which kind of query i should pass there to get output.
    My requirement is data sholud be reflected in case any similar query or word is given for searching.(may be done appending automatically etc "as select * from table where nam like "%s%" something like this.)
    Please if possible give me the exact picture,so that i can also get benefitted from this.Many people told Lucene.
    Any more clarifications required,welcome.

    Thanks
    Vijendra
  35. i need help with lucene formula[ Go to top ]

    HI iam a student in rabat and iam working on search engines and i wnt to understand the formula of lucene http://www.lucenebook.com/blog/errata/2005/01/24/ THANKS FOR HELP you contact me here THANKS