TSS interviews Doug Cutting, founder of Lucene and Nutch


News: TSS interviews Doug Cutting, founder of Lucene and Nutch

  1. Doug Cutting, the founder of Lucene, the text search library that powers TSS and hundreds (if not thousands) of other sites, is interviewed by TSS in our latest video Tech talk. Doug talks about a lot of behind the scenes history, challenges, and implementation details about Lucene and search in general, as well as focusing on Nutch, a complete open source search engine he is working on.

    Watch Doug Cutting on search with Lucene and Nutch

    A lot of developers don't realize the kind of genius and time it takes to build a tool like Lucene, I think we are very lucky to have someone of Doug Cutting's talents spending all of his time working on this stuff, how many people could do this:
    Question (from the interview): How do you make some thing like Lucene as fast as it is?

    Answer: I do it a few times. I have written a few search engines and done a lot of benchmarking and looked where they spent their time and then rethought it and I think it helped a lot that it wasn’t the first search engine I had written. I think at Xerox I did a few iteration of very different architectures, then did so again at Apple and then again at Excite and so I have been through it a few times and knew what needed to be quick and what did not.
  2. Nutch is pretty great[ Go to top ]

    We worked with Doug and Oregon State University to deploy Nutch (I think it was one of the first large scale deployments of Nutch) replacing a commercial, licensed search engine. The project saved OSU over $470,000 dollars. Really great toolset.

    Jason McKerr
    The Open Source Lab
  3. Lucene is wonderful! We are using it and are more than happy.

    There is one feature, we have not fully figured-out, yet, though. Doug mentioned it in his interview, too. It is, usually, known as "Fuzzy Search" in the search terminology. This means - searching for mistyped words, by guessing what the user might have meant.

    From the Doug's interview the imperssion is - Lucene does not currently support it. However there seems to be some code in Lucene for this, but could not make sense out of it, yet.

    TSS is using Lucene, but from user's experience it seems Fuzziness-support is not there.

    Anybody - any experience/solution for this?

    Thank you
  4. The magical tilde[ Go to top ]

    If you are using the QueryParser, just tack a tilde on the end of a term (fuzy~). It gets fuzzied.

    In regular code...
    IndexSearcher is = new IndexSearcher(dir);
    Query q = new FuzzyQuery(new Term("field", "value"));
    Hits h = is.search(q);
  5. And you can even use it on you desktop to google your email... or should that be nutch your email :-)

    Googling Your Email

  6. I have just started looking at Lucene.

    We need to provide searching within our web application, and the results are based on the object network. Would it be appropriate to be building indexes based on the object network?

    Or is this misusing the technology?

    Currently we have an elaborate sql forming ui/mechanism which allows users to build a query and execute within plsql. I am investigating whether/how this could be replaced within Java.