Search Engines

Discussions

EJB design: Search Engines

  1. Search Engines (9 messages)

    I want to make mention that this is an article about PHP and not Java but please keep reading.. in my humble opinion, it doesn't pertain to any language and I do know Java Enterprise beans a bit, so if you go off and spin an answer in J2EE then I'll get without any troubles.

    I want to design a search engine that is completely reusable from project-to-project that gets its information the database. I was basically thinking this kind of architecture (not really talking about where the logic goes or what everything is). It's just an example of a site that might use news and discussion articles from the database.

    // create a generic search engine
    searchEngine = new SearchEngine();

    // this is a project specific discussion search module
    discussionSearchModule = new DiscussionSearchModule();

    // same thing but for news articles
    newsSearchModule = new NewsSearchModule();

    // add the modules to the search engine
    searchEngine.addModule( discussionSearchModule );
    searchEngine.addModule( newsSearchModule );

    // create an expression from the text typed in (which also validates it or corrects it)
    searchExpression = new SearchExpression( "Word1 AND Word2" );

    searchEngine.setSearchExpression( searchExpression );

    // get all the results from the expression in all modules
    results = searchEngine->process();

    // display results in links
    while( link = results.fetch() ) {
       link.display();
    }

    is that a good design? You can figure out where there would be abstract classes and polymorphism. What is a good search engine architecture out there? Any articles that I could read? Thank you.

    Thanks

    Threaded Messages (9)

  2. Search Engines[ Go to top ]

    Actually, querying the search engine is the easy bit. I would look at sites like www.verity.com which has a full Java API for searching it.

    Verity does everything you require, you can plug databases, flat files, Lotus Notes, whatever you care to write a connector for into that architecture.

    The hard, bit, and the main reason I would suggest you investigate products such as Verity, is building the collections you want to search so that they perform.

    In all the projects I have used Verity on, 90% of the time was spent on collection tuning (Even when you are passing the request through to the underlying database). Only 10% of the time was on accessing the collections, since that is already provided by the classes.

    Verity has some interesting Java features, using it's distributed fail safe architecture, which you should look at if you really want to get into doing this.

    On the down side, it's EXPENSIVE!!! If your underlying data repositories are not too big, then you can roll your own, in which case the general design you suggest looks OK.

    Your code sample looks OK, although in a production environment your search engine would either be a thread safe singleton, or a service you call, rather than being initialized by the client.

    Chz

    Tony
  3. Search Engines[ Go to top ]

    Thanks for the reply. Out of like 6 sites that I placed this message on, I get like 1 reply. I guess talking about search engines is over everyone heads (beacuse they buy one instead of make one) or they don't wanna give away secrets :( Well anyhow, this code would be done on the server most definately. That last bit about getting the results would make some html and then that would go to the client obviously. I just hoped that my design was good enough and reusable enough. The only think I could think about coding was the Modules (or Collections as you call them). Perhaps I could just call them Collections since they are a bunch of all simular data that I need to search through and Collections is a good word for that. (I'm picky with wording when i design my classes :) I'll check out that search engine you mentioned. Perhaps I could some more features/architectural ideas from it.

    Thanks!
  4. Search Engines[ Go to top ]

    In Verity, a collection is actually a series of files on the filesystem that contain the indexed words etc. from the data source you point it at.

    This is optimized to be searched and updated by many concurrent users, but it isn't optimized to do two way queries, like a database.

    So, it is great at:

    List every link you have that is indexed against country UK.

    It is terrible at:

    Give me a distinct list of all countries for which I have a link.


    Verity, however, does not force you to use a collection. You can use various modules so that things appear as collections to the searching API, but in essence are pass throughs to things such as RDBMS, Lotus Notes (ug), the filesystem, and anything you choose to write which conforms to their interface.

    This bit, is the bit you are talking about writing. In which case, your module approach is similar to theirs. Since they can search 20million entries in <2 seconds with 800 concurrent users I suspect it scales rather well (OK, they did have several big servers and a lot of tuning, but they did it!)

    You are right though, search engines are hard to write and people shy away. Having done a lot of work with Verity and having a pretty solid understanding of it's architecture, I think it's one of the best and definitely worth looking at for ideas.

    Hope this helps.

    Chz

    Tony
  5. Search Engines[ Go to top ]

    Great, at least now I know that I'm heading in the right direction. I just kinda made that architecture up today, thinking about for an hour and that was the best I could come up with. Thanks for your help Tony. Your info and advice was great and credible.
  6. Search Engines[ Go to top ]

    Hi
    Take a look at this search engine. http://sourceforge.net/projects/lucene

    I took a completely different approach. If the search information is gathered from the database, you will have to cater to the database changes that varies from project to project.

    What I have done is to use a simple HTML Parser, separate the text out from the html, and submit it to Lucene for indexing. The site is mainly readonly, infrequent updates to the indexes, limited to one indexing done per day during off-peak times. This way, I think it is more reusable and simple. The difficult part is to generate all the urls for indexing. There may be difficulties with Asian languages though.

      
  7. Search Engines[ Go to top ]

    I'll take a look. I'm looking for something that can search through at least 30 tables of info with thousands of results in each table within a few seconds. I was thinking that each specific module would know how to construct their own links and know what methods to call (though a few methods will be there all the time like findInName and findInText and various things.

    The way I thought about it, it would cator to the database all the time to millisecond. That's why I thought it might be slow. Every time someone made a search, it would construct those objects and perform the search. I do think this is slow and have thought up ways to keep all the keywords, hits and what they relate to in the database. Once you get all the initial keywords from everything in the database, all you have to do is maintain it with the adds/edits/deletes which is less overhead than someone searching and construction objects and going through the entire entire database.

    >>What I have done is to use a simple HTML Parser, separate
    >>the text out from the html, and submit it to Lucene for
    >>indexing. The site is mainly readonly, infrequent updates
    >>to the indexes, limited to one indexing done per day
    >>during off-peak times. This way, I think it is more
    >>reusable and simple. The difficult part is to generate
    >>all the urls for indexing. There may be difficulties with
    >>Asian languages though.

    With my site, there will changes every second (by many people using it). It's going to be a very widely used site I'm predicting and the features on it pretty much guarentee constant use as using them like that is most useful to people. It use more reusable. How do you construct the links after you find the information? I only have like say 25-30 actual pages.. but all of them go to the database for content probably making thousands of pages. How does it know the urls for all these pages? I'm just curious.

    Thanks for the help and hope to hear a reply.

    Ken
  8. Search Engines[ Go to top ]

    That's the standard Verity approach also.

    "Don't hit the real data repository all the time, hit it once, and create collections from it. Then you just need to track changes to that datastore, and periodically update the collections."

    It also lets you define combinations, concatenations etc etc which comprise various fields in the collection (along with the free text.)

    You can use any one of these fields, say a user field you create called my_hyperlink to be the link.

    Like I said, you can use the product as a pass through, but for maximum performance you really need to build some form of index optimized for the one way search, as I defined in an earlier post.

    Architecturally, this makese sense.

    Note however that Verity has very little on offer to track the changes to the database, you have to use triggers and "update tables" to track things, but it isn't too painful to do.

    The last major verity implementation I did was for a major European investment bank, for which we:

    1) Built 4 collections, containing between 50 and 150 THOUSAND documents each (fully free text indexed.)
    2) Updated the collections every 5 minutes based on newly updated data.

    We could search this collection with < 1 second response times, with 10,000+ registered users. All searching was done against collections, no pass through to the database.

    I'm not trying to dissuade you from writing this, but what you are proposing is the tip of a very big iceburg. If you look at the feature set of search engines, they are vast. You may end up wishing you wrote a cheque and bought one. :-)

    On the other hand, I still think you are on the right track, but as others have suggested, you need to avoid hitting databases frequently for the searches.

    Are they mainly parameter searches, or are there free text things as well?

    Chz

    Tony
  9. Search Engines[ Go to top ]

    that's pretty freakin cool :) I think I would use an index to everything that I need to search through then. I'll somehow create indexes in the database and then make those module classes contain all the concrete information about the indexs. I'll try to come up with a common interface for of situations. This way, I can use my standard model for maintainability reasons but get the speed that I need so I don't have to go through all the articles every time. Thanks a lot tony!
  10. Search Engines[ Go to top ]

    Lucene works by creating indexes and searching thru the indexes.

    As to frequent updates, you can do incremental indexing if u wish. As for the links, when u create the index, you can specify certain fields to be stored with it. I specifically specify the title, url, link , a short category. e.g. If i want to index the whole forum, https://www2.theserverside.com/discussion/thread.jsp?thread_id=XXXX, the thread_id will have to be generated. I will specify the category myself as "Discussion".

    You are right about generating the urls for indexing, e.g. https://www2.theserverside.com/discussion/thread.jsp?thread_id=XXXX, I figure this will be an easier task rather than to go to tables after tables for the keywords. IMHO, this is faster to implement, but of course, it varies from site to site.