Video streamVideo streamVideo stream
Question indexQuestion indexQuestion index

Doug Cutting - Founder of Lucene and Nutch

Doug Cutting, the founder of Lucene, the text search library that powers TSS and hundreds (if not thousands) of other sites, is interviewed by TSS in our latest video Tech talk. Doug talks about a lot of behind the scenes history, challenges, and implementation details about Lucene and search in general, as well as focusing on Nutch, a complete open source search engine he is working on.

Discuss this Interview


Doug would you like to introduce yourself?

My name is Doug Cutting, I am the lead developer on the Lucene and Nutch open source projects. I have been working on search technology for over 15 years now I think. I started out at Xerox PARC doing research in information retrieval. I went from there to Apple where I wrote a search engine that has now surfaced in the Sherlock product when you search local files. I went from Apple to Excite where I was the primary engineer on search there for 4 or so years and since then I have worked for a few small companies and I am now doing independent open source work, pretty much the whole time.

Can you tell us a little bit about history of Lucene?

Lucene is an open source full text search engine. It started out back in about '88 I believe, early '88. I took some time off while I was working at Excite and wrote it as a way to teach myself Java and also because I thought it would be fun to write a search engine from scratch, and thirdly, I think, because I wasn't sure about my future at Excite and wanted to have some technology that I own so I can do something independently, but then I got caught up in projects at Excite and wound up not doing anything with Lucene for a few years and then around 2000 made it an open source project at SourceForge and it started gathering developers and has continued to do so, and in 2001 I think it was invited to join Apache and gaining more and more use since then.

What was it like learning Java writing Lucene?

It was fun. I liked Java from the start. I have been programming in C++ for a while and a friend of mine once described Java as C++ on Prozac, with a lot of the nasty bits taken out. Also writing in C++ for server side software had proven to be pretty painful at Excite. Lot of memory issues, lot of crashes and Java seemed like a much more robust way to go there and was plenty fast for the sorts of things we were doing, and so I thought it was a good candidate for a search engine.

So most of us when we look back at the code we wrote a few years back, it does not look anything like the code we write today? How does Lucene look to you now?

It could be better, but it is not terrible. My style and Java has evolved since then. I would use interfaces in more places, I would use final less than I did, and then of course, I would use a lot of the newer class libraries, but overall I think it has aged pretty well. I have been doing object-oriented programming in one form or another for quite a while. I might also have been doing search engines for quite a while and then on top of that I thought for a long time about the design of Lucene before I wrote any code, so it came out pretty well thought out that the APIs and there are things here and there that I would like to clean up, but could not do so compatibly, but for the most part I am not unhappy.

So what were some of the challenges you faced writing Lucene?

I was obviously concerned with performance and needed to figure out what were the things that were fast in Java and what were the things that were slow and they are actually not that different from the things that are fast and slow in the other programing languages. You may be a little more tempted towards the pit falls in Java. Object allocation is slow, array accesses are fast and little things like that and so I needed to make sure that I designed the algorithms so that the key parts could stay in the fast parts of the language.

When you talk about the parts of Lucene what are the major parts of a project like that?

Well there are about 500 packages within Lucene that pretty well break down Lucene's structure. There is a storage package, which has to do with how Lucene stores its data so that you can move Lucene's data from a relational database to the file system to keeping it at in memory. There is a package having to do with representing documents. What is a document that you want to have indexed and searchable? There is a package having to do with how once you have a document, which has some text, how you break that text into words, because words are the unit of search that is used by Lucene, and then once you have got words out the text of the document, you then build a database and so this is the index package, which builds a very specialized set of data structures and is specialized to support searching. Then there is a search package, which are search algorithms, which operate over the index data structure that you built ahead of time and implement lots of different sorts of operations find me documents that contain a word, find me documents that contain some combination of words, this word and that word, but not that word or find me documents that have these two words right next to each other or close together and combinations of these sorts of things.

How do you make some thing like Lucene as fast as it is?

I do it a few times. I have written a few search engines and done a lot of benchmarking and looked where they spent their time and then rethought it and I think it helped a lot that it wasn't the first search engine I had written. I think at Xerox I did a few iteration of very different architectures, then did so again at Apple and then again at Excite and so I have been through it a few times and knew what needed to be quick and what did not.

There are other areas of search engines like suggest this, how do write something like that where it says, did not find this but did you really mean x?

That is a lot like a spellchecker sort of a technology. Lucene does not have anything like that built-in, but there are open source spellcheckers that you can use. The one thing that you can do when you have an index like Lucene's is you can take into account the frequency of occurrence of the word, so that if you search for something that is very rare and there is something that is spelt very similarly, which is very common then you might suggest it. Whereas if there is something that is spelled very similar which is rare, you might not suggest it. We have some better criteria to do spell checking than a spellchecker has because you have this frequency data, but its essentially the same problem, spellchecking and eventually we should probably have a module in Lucene, which does that but we don't at present.

You have written a lot of searching libraries and functionality, any cool uses that you have seen people use that technology for?

A lot of people use it to search perverse things. I have heard about people using Lucene to search for colors like other colors. People use it just as a database because it is lighter weight and faster for a lot of applications than a relational database and that is somewhat a perversion because it is really designed for searching full text, but it actually does those things pretty well. I like things like sort of some surprising things, the Finnish military web site, Bob Dylann's web site stuff like that is kind of fun, but there are a lot of just high traffic Internet sites that use Lucene and I am proud of this. Lucene has a powered by page where you can go and look and see all of the sites, which use Lucene or I should say all those that decided to report it. I suspect it is probably only 10% of all the sites, most people who use Lucene don't bother to tell us that they are using it.

You are known for Lucene, you are also very well known for Nutch? Can you tell us a little bit about that?

Nutch is a new project. Lucene is fairly mature as I said at least in these times what we call mature, being four or five years old. Nutch is maybe a year and a half old. Where Lucene is a library to let you search two applications, Nutch is an application itself. It is an application for searching the web, for searching web documents and so it is got a crawler, knows how to parse HTML, and then various other specific things that you need to do for searching the web. Nutch is designed to scale unlike other things that other applications, which attack this problem. Nutch is designed scale all the way up to billions of pages, to thousands of searches a second. From the outset, that has been out design goal is to really give the open source world software like that that say Google runs or Yahoo runs or Microsoft will soon run and so it is an ambitious challenge to hold that off and that is where we are headed.

There are other things you have to do with web search, like ranking pages?

Yeah! There are a number of techniques which are specific web search that have evolved in the not even quite a decade I think that we have really had web search engines. There is one thing that Nutch keeps, is a database of all of the links, so a graph of the link structure of the web, which Nutch can use in a couple of different ways. One of the more important ways is to associate with each page the text of all of the links that point to that page. So if you've got a link on a page, an anchor tag, it has got a URL associated with it, but it also has a string that is blue and underlined typically in the web page that is what we want to associates with the URL that it points to and it turns out that the site of these anchor tags as they are called, all of the text that point to a page are extremely valuable. They are very high quality text that simply describe the content of a page and if you think about it, you can think about the person's home page. Say for example their name might occur once at the top, probably they don't repeat it a lot or even a company's home page may not have the company's name on it a whole lot, but all the links to it are likely to be, in the case of a company one word links or maybe two word links with the company's name in it or if it is a home person's home page with the person's name in it, and so it gives you a lot of data that you wouldn't otherwise have. You might get 30 links to it all of that and there is going to be no other page with 30 links that point to it without that particular word. So you got Disney or something, you are not going to find a lot of pages that have 30 links to them named Disney that aren't a Disney page. That makes sense.

TSS: Oh absolutely.

Another thing that is always fascinating about search engines is you'll search for something like to be or not to be expecting it to do something totally crazy, but it doesn't. It does the right thing. How do you handle circumstances like that? How does that work?

There are lost of tricky cases. There is a whole set of techniques that are used to handle those sorts of things. Typically if people don't put them in quotes then you say oops they probably dint mean to search for those and you just ignore them, but if you put it in quotes and make it a phrase then a search engine will actually try to find pages that contain that actual string. In order to accelerate that there are a few tricks. One trick is to note that, in fact a primary trick that the pairs of words that occur in a phrase like that, the individual words are very very common and occur in practically in a large portion, not every page, but nearly every page on the web. Words like to or be, but the phrase like "or not," that combination of words is used on far fewer pages. So you can pre-index these word pairs, which you will call bigrams and accelerate searches like that incredibly. You have to be careful, you can't pre-index every pair of words that occur next to each other or else the index gets way too big, but you can do it for some very common words and so that is one trick. There are other tricks having to do with if you've got a very common word and a very rare word that you want to match both of those, which is also often the case in a phrase like that. Then you can skip through the data about the common word until you get to documents that also contain the rare word and you can do some optimization around that. There are just a number of techniques and you throw them all at them and you can make those pretty fast.

So what inspired you to do Nutch? You say it is a young project, it seems like another Google, so why do we want another Google?

Well, it is a number things. For one thing, for me personally it was just a sort of a fun thing to do. I would like to write search engines and I think that open source software is great to have. For every mature technology there should be an open source implementation so that people can share in working on it and making it better, but also I think there is a particular need in the case of search for it to operate transparently, for people to understand why their documents are ranking the way they are. Using a search engine to find medical and political and all sorts of very important information to your life, you would like to know why one page is shown to you and another isn't or one should be able to find out, maybe they want to check every time, but you want to make sure that things are on the up and up or if there is bias know what that bias is and whether it is a bias you disagree with or agree with when you use a search engine. With the all commercial search engines, their algorithms are secret and so we don't know how they work and why they rank things the way do. So that's I think something of a problem for the users of search engines and an open source search engine is the way to approach that. If there is free technology out there then people can see how it works and in fact Nutch has a button on its search results, which for each hit will explain exactly why it ranked the way it did. All of the numbers that went into scoring and folks can look at that if they want.

Do you spend your time pretty much fifty-fifty between Nutch and Lucene?

It is not any way near that simple. I do some contract work related to both of those projects and I do some things on the side as well. So I'd say I probably spend a third of my time roughly on Nutch, a third on Lucene and a third on other things, and I work from home and we've got little kids in the house so another third on little kids stuff, I am stretched pretty thin but manage to make it work.

So where does it go from here? Where do Nutch and Lucene go from this day?

Lucene is as I said before, pretty mature and I think it can take care of itself. I am still pretty involved, but I am not so worried. If I fell off a cliff or was hit by a tractor or something Lucene would be fine. Nutch is still in a more tenuous position and so I am much more concerned about being strategic with my efforts on Nutch and trying to figure out ways that we can attract more developers and more users so that we get a critical mass. Open source projects need a certain number of people using them in order to attract a number of developers and in order to get maintained to get valuable features out of them. There is a lot that needs to be added to Nuthch still. Things to handle more document formats, PDFs and word documents and what not. There are just lots and lots of things that would make it a better product, things like suggesting alternate spellings and so on and it just takes a lot of people to do all that. So I am more focused and more concerned about keeping Nutch going, but it is more fun in some ways to do Lucene, because there are so many people using it, it is more rewarding in an immediate sense. So its nice to have a combination.

So what would be your top ten list of developers that you would love to get involved in these. What areas of expertise would you love to see coming, how do people get involved and help out?

Well let's see. One of the things I mentioned, it just takes good Java developers. It doesn't take particular strong skills, I think somebody can come up to speed on this pretty quickly. We need to be able to have plug-ins to do format conversion from various file formats. The format converters are out there and we just need to get the plug-ins in. We need again the same with sort of spelling checks. There are spell check algorithms out there, it is just sort of a matter of integrating them into this application. It is really not rocket science that much. We need people to work on the quality of the search. We got somebody who recently has volunteered to help there, but its a huge problem to do evaluation so that you can do searches and see how the results look and figure how to improve the ranking algorithm and make them better. So somebody who wants to work on it, we are not going to turn him away if they are a Java developer. There are lots of people. I think what I want more than anything is companies for whom or even large institutions, universities and folks like that who have a need for a search engine that can crawl their website or can crawl some subset to the internet, say all the pages written in a particular language something like that or about a particular topic, and who want to actually run that and therefore will have a vested interest in helping out the project and making sure that it works. We are not going to hire people to work on Nutch and what we want is we want other people out there to hire people to work on Nutch. The best developers are people whose job it is to work on the project and the way you make that happen is to convince people that they need to run it. So that is the avenue that I pursue more than anything, besides just working on it making it more usable is to try to encourage people to actually use it.

TSS: Okay. Well, thank you very much. I don't know if you are able to sing, but we would like you to sing a song now for us.


News | Blogs | Discussions | Tech talks | Patterns | Reviews | White Papers | Downloads | Articles | Media kit | About
Java Solutions
All Content Copyright ©2007 TheServerSide Privacy Policy
Site Map