Complex Database Searches and Huge Searches

Discussions

Performance and scalability: Complex Database Searches and Huge Searches

  1. Anyone has any references/suggestions on good J2EE designs for implementing complex database searches (combination of 30+ fields) and handling huge search results ( in millions ). Looking for ideas on data search, retrieval and presentation(web).
  2. I'm using this approach:

    1. Query Engine (Stateless Session EJB) component does all the job
          public Object[][] execute(String sql, String dsrc_name, String[] columns, Object[] params) throws RemoteException;
    2. Where sql parameter is usually the call of SP or parametherized query, i.e. smthing like
          exec list_sys_users ?, ?, ?, ?
  3. Since this is just getting the database, there is no need to need to have a EJB layer before the database call.(FAST LANE READER PATTERN).
                          Instead use the pageination pattern and get the data in pieces or pages, so each time the he clicks the NEXT, get the next 100 records.
                            This approach is only good if the data you want to bring is huge (million record).
  4. But even in pagination, if you consider sorting, we would have to retrieve all the million records, sort and then get the first 100 or so records. is there a way to avoid getting all the records everytime.

    plus, we have to think of caching it or keeping the connection open, if we use paging? any other ideas? or better way of doing this?
  5. No you misunderstood it.In pagination, we don't retrieve all the million records,instead we just get the first hndred records.So when user hits next, you get the next 100 records from DB which will increase your performance VS getting all the million records at once.
  6. You must use database-specific SQL to avoid this. For Microsoft SQL server, see:

    http://www.sqlteam.com/item.asp?ItemID=566

    Other servers have similar features, for example Oracle:

    http://gethelp.devx.com/techtips/oracle_pro/10min/10min1200/10min1200.asp

    Pietari Laurila
  7. plus, we have to think of caching it or keeping the connection open,

    > if we use paging? any other ideas? or better way of doing this?

     if you really need paging - you can use ROWID/temporary tables/open cursors at DB level, no need to get all millions records.
      
     But, IMHO, filtering list i.e. single page with top N records is applicable in
     99%. here, if the filterd record count exceeds N then user just promted to refine search.
  8. Since this is just getting the database, there is no need to need to

    > have a EJB layer before the database call.(FAST LANE READER PATTERN).
     
      SLSB is useful when you need:
     1. access restriction. QueryEngine facaded by other session EJB then.
     2. provide other EJBs by such a service
  9. You could also try using IronTrack SQL to tune your code. There is a free trial available.
  10. What is the total size of the data set? How many million items?

    How many attributes of each item? 30?

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Easily share live data across a cluster!
  11. What is the total size of the data set?


    A typical data set in about 5 million.

    >>How many million items?

    yes, about 30 cols per row ( from diff tables )

    Thanks
  12. A typical data set in about 5 million. ... about 30 cols per row ( from diff tables )

    So, depending on what types those fields are, you're talking about a realized object that is somewhere between 150 bytes (primitive types) and maybe a couple of kilobytes (medium sized strings, etc.) I'll assume 1KB, times 5 million, gives you 5 gigabyte of objects. That's a good sized data set.

    Our Coherence software lets you load them all into a partitioned clustered cache and index the data, letting you do distributed (parallel) queries on the data. It also supports paging (i.e. a limit filter with a starting position). You'll probably need a cluster of at least four servers for that type of work though, just considering memory requirements.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Easily share live data across a cluster!
  13. We have implemented something similar to what Cameron is describing using Coherence. In our test environment we were able to load about a million 3k objects clustered across 4-5 cache servers. Once the data is loaded in the cache, we were seeing complex query times that beat our expectations and scaled under load.

    Rick