Introduction to Text Indexing with Apache Jakarta Lucene


News: Introduction to Text Indexing with Apache Jakarta Lucene

  1. Lucene is a Java library that adds text indexing and searching capabilities to an application (and is commonly used to create searchable websites), part of Apache Jakarta. As of November 2002, Lucene version 1.2 has been released, with version 1.3 in the works.

    Read Introduction to Text Indexing with Apache Jakarta Lucene.

    Threaded Messages (10)

  2. Lucene review[ Go to top ]

    I had to implement a keyword search feature for my company's software similar to any of the search boxes you see at the major ecommerce websites. I decided to evaluate Lucene for that purpose and I couldn't be happier with the results. It took me about a day to figure out the API, write some bridge code to populate the indexes and write more code to do the searching. It was fast and accurate in all my testing.

    There were very few caveats - it stores its indexes on the filesystem but apparently now has a JDBC plugin if you want to use a database.

    I strongly recommend others check it out if you need free text searching.
  3. Lucene get's a 100%-approved ultra-cool library of the year award :-)! No kidding, it is exceptional, both from a programmer's perspective and in real-world production operations. I've used it a lot for search spaces of about 500k documents and it was very fast and easy to use.

    We are using it ourself in medium-size projects, but we gave the tip to one of our partner companies, who replaced their commercial search engine with Lucene. Poof! The number of Solaris multiprocessor servers necessary to run their search application decreased from 20 to 5. And this is one of the biggest search-intensive commercial websites in Germany.

    Lucene used to be unknown in the general Java community. At that stage, it was a "secret weapon". But now it's not secret anymore. In the Java world, I think it's the engine number one. Don't leave home without it.

  4. Just wondering if Lucent supports indexing and searching of multi-byte languages, like Chinese? I looked at the documentation, but didn't find it being mentioned any where.
  5. Lucene can support multibyte languages, but not "out of the box". An important part of Lucene is breaking down documents/search terms into tokens that can be search upon/with. Currently, the Tokenizers that come with Lucene work well with European languages.

    You can read more about this here.

  6. From the documentation, it mentions.....
    A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
    That means it can't work with Chinese characters...???
  7. Lettertokenizer[ Go to top ]

    It is saying that the LetterTokenizer class that comes with Lucene does not tokenize some Asian languages well. Basically, this tokenizer assumes that tokens are represented by adjacent characters. If, however, there are adjacent characters that represent more than one token (i.e. words are not seperated by spaces), this class will not tokenize them correctly.

    Here is a link to a contributed library that may help you out.
    For more info, go here and search for "chinese" or "asian".

    Good luck.

  8. Does anyone know if Lucene provides any way to index Microsft Office DOcuments ?
  9. Indexing Microsoft Word[ Go to top ]

    There is no reason you couldn't. The hardest part would be parsing the document to extract the content you want index.

  10. Indexing Microsoft Word[ Go to top ]

    You could use the Stellent Outside In Server, formerly the Inso Filters, which is pretty common (for instance, Yahoo Mail used them to let you view Office documents as HTML last time I checked ). I can't comment on cost however so maybe there is a free solution if you don't have budget for this type of commercial product.

    (and no, I have no connection to Stellent).

  11. I would like to know how Lucene is advantageous over Advanced features of Oracle 9i text.
    My application manages huge content and assets upto 2TB size now, and the db size is growing very fast. We plan to use oracle9i interMedia to store assets(word, pdf, pt, xls , jpg,eps, gif files etc). The db stores assets for each feature film title , keyarts, stills etc. the content is presently in english and spanish languages. The application also contains other assets sections and some modules like news, press release etc.
    The user can do advanced search in any of these sections. The user should be limited to get search results based on thier permission (like news only and not press). The user should see search results in their preferred language (when content for that language exists). User can get access to the content or the asset/files (inlcuding image,video,pdf).
    For this data volume which search should I be using - Lucene or Oracle ? can i get a comparsion interms of performance, usability, scalability and future db growth.