New Article on the DataListHandler Pattern Posted on TSS

Discussions

News: New Article on the DataListHandler Pattern Posted on TSS

  1. A new TSS article by Claudio Fratarcangeli looks at the DataListHandler Pattern which represents a substantial improvement over traditional methods for efficiently processing Internet searches and quickly displaying the results to the user. The DataListHandler design pattern addresses the problem of processing Internet search queries that return large result sets in an efficient manner.

    Read "Data List Handler: A Pattern for Large Search Result Sets"

    Threaded Messages (30)

  2. Interesting.

    Oracle's J2EE Application Framework called "Business Components for Java" provides a ready-made implementation of this pattern.

    We call it our ViewObject, which implements our RowSet, and RowSetIterator interfaces.

    http://otn.oracle.com/products/jdev/htdocs/j2ee_bc4j.html
  3. It's odd that this description focuses on internet searches. Surely a far more common problem is paging through database results sets.

    Be that as it may, however, to say that this "represents a substantial improvement over traditional methods" is a bit sensationalist, since nothing in the article is much of a revelation, but simply just good design.
  4. I read this with some interest until I got to the part that said, to para-phrase, "this pattern requires a connection to the relational store to be held open across several user requests". The author only briefly mentions that a strategy needs to be developed for expiring connections with user inactivity. I don't see a practical application here.
  5. I also found it disappointing[ Go to top ]

    I read through the problem statement and thought, yep, that's a situation I've seen. The application I work on has to display potentially large lists of items through an HTML interface and let the users page backwards and forwards through them. I was looking forward to seeing a solution to this problem. Then I read the solution.

    "Use a generalized interface to abstract away the underlying implementation details of the search and list processing logic. Provide an implementation of this interface to support efficient retrieval of large result sets from relational databases."

    Somehow this didn't strike me as a solution. Just defining an interface to hide the underlying problem does not solve the problem, the dirt is still under the rug even if you can't see it. I read through to the end but I agree with the prior comment, holding connections open, even with many wonderful Oracle tricks, may not be the best plan. I may be being a bit harsh but somehow I was expecting more, some solution that would make me think, yep, that's clearly the way to do it. It may just be that this is one of those problems with no solution that is completely satifying.

    By the way, what we ended up doing was just rerunning the query each time. We use stored procedures to provide accelerators for some databases, (Oracle, DB2, and MSSQL) and fall back to plain SQL if they are not present. Our thinking was that, as far as the computer is concerned, the user's think time is so long compared to the time taken to process the query it just isn't worth producing a complex solution to the problem. Of course, our query is relatively simple and you millage may vary. The performance tests we've run don't show the paging as being a bottleneck so we're not going to do additional optimization at the moment.
  6. I also found it disappointing[ Go to top ]

    It's a pretty common problem. I think the only way to do it really efficiently is to cache the full search results ( maybe just keys ) and pass back a 'searchResultId' so state is maintained.

    Otherwise you are left with the problem of having to do the entire search for each page.





  7. I also found it disappointing[ Go to top ]

    A smart trick to do with Oracle is to force the database to retrieve just the page of the ordered query results that you want at a time. This is accomplished using the database's efficient support for "Top-N" queries.

    Imagine you had a query (like the sample one below that queries our bug database for all RDBMS bugs fixed in the 8.1.6 release). If you bind the "Page Number" value to the 1st, 3rd, and 5th bind variable, and the "Page Size" (rows-per-page) value you want to the 2nd, 4th, and 6th bind variable), you'll return just a set of up to "Page Size" rows on the "Page Number"-d page of results. This technique can be done in a totally stateless way and minimizes the data that needs to be exchanged over the network from the database tier to your app.

    select rn, rptno, subject
    from (
     select rptno, subject, rownum as rn
      from (
       select rptno, subject
         from rpthead
        where category = 'RDBMS'
          and status > 80
          and ver_fixed_numeric = 801060000
       order by rptdate desc
      )
      where rownum <= :1 * :2
      /* Fetch top :1 (pagenum) * :2 (pagesize) rows */
    )
    /*
    ** :1,:3,:5 is "page number" and
    ** :2,:4,:6 is "page size"
    */
     where rn between (:3 - 1) * :4 + 1 and :5 * :6

    The above example uses Oracle bind variable style, but just replace the :1, :2, :3, :4, :5, :6 with six JDBC-style question marks, and you'll be in business.
  8. 2 issues at stake here[ Go to top ]

    The two issues at stake here are:
    1) Preventing requiries.
    2) Minimizing data transfer over the network.

    Steve's solution solves problem #2. But it doesn't
    solve or alleviate #1. What alternatives exist beside
    loading the whole database in memory (in the extreme)?

    legolas
  9. I also found it disappointing[ Go to top ]

    Your solution is just like the one I have used before.
    In fact, I think it can be implemented using a more simple sql statement:

    select rn, rptno, subject
    from (
       select rownum as rn, rptno, subject
         from rpthead
        where category = 'RDBMS'
          and status > 80
          and ver_fixed_numeric = 801060000
       order by rptdate desc
      )
      where rn <= :pageNo * :pageSize
         and rn > (:pageNo - 1) * :pageSize
    )

    But when using this method, there are still problems that we must take into account:
    1. this qurey will actually do a complete search and sort in database every time when run this query( see the inner select ). so , if the inner query is very complex and resultset that meet the inner query is very large, then, database have to do a sort.

    2. this sql statement can just be executed using oracle 8i and above. oracle 8 and oracle 7 don't support it.

    so, personnally, I don't think this skill is very useful under hign conccurent user access , large volumn database , complext inner query .

    In our porject, we do a cache to store several pages in web container in order to avoid execute database access every time that a user goto next/previous page.

    Addition:
    I have read lots of pattern that descibe how to do page by page query. but , I am very disapinted that, almost all of these patterns just give out a very general interface to describe WHAT WE WANT TO DO, but they don't give out a detail and feasible solution about HOW CAN WE DO. so, essentially, they are almost useless to put into practice.

    I don't think a pattern that can't tell me how to implement it is a really pattern.

    Thanks for Steve Muench to give out a feasible solution.


    please let me know if I am wrong.
  10. Paging Pattern[ Go to top ]

    The first part of the article lists a very compact analysis of the problem, but the text under "solution" has to be worked on a bit because it is slightly "dilbert-esque" and would make readers with limited time stop reading :-)

    The solution is interesting indeed, but in the end it's as good as the implementation of the scrollable ResultSet is. When going into tricks you loose portability...Tough one, we also ended up with a combination of re-running the query and caching some results. Probably the scope of internet search was set to show that in that given situation you cannot limit the number of pages returned and thus the upper limit of results to return...

    At least there is some good discussion and ideas coming out of the article, and overall it is well structured and it presents the problem in a clear manner.
  11. CachedRowSets[ Go to top ]

    Can anyone comment on whether CachedRowSets may be applicable to returning paged sets of query results?

    To give a contrived example, say you were looking to query 10,000 objects in a database. Without CachedRowSets, you would need to hold on to the result set and the corresponding database connection for the duration of the user's paging through the results. However, it seems possible that CachedRowSets will allow us to disconnect from the database and have a snapshot in time of the query results for the user.

    http://www.javaworld.com/javaworld/jw-02-2001/jw-0202-cachedrow_p.html

  12. CachedRowSets[ Go to top ]

    Using the CachedRowSet still has the entire resultset in memory, something that you want to avoid for large lookups.

    The CachedRowSet is convenient, in that for display purposes, you can avoid a list of VO's to return back to the presentation tier, therefore less development required.

    - Rich
  13. I also found it disappointing[ Go to top ]

    Accessing all results in single query may have performance problems. If you go through best practices by IBM - first thing they mention taht don't make Connections as instance variable , declare Connection in the method as static variable and close it when you are done.

    In fact, I also implemented search results using Inline view my query looks like

    Select x.alert_id, x.alert_author from (select rownum rno, a.alert_id,a.alert_author from gf_alert where alert_desc like '%<keyword>%'
    order by a. alert_id) X
    where x.rno between <index> and <index+range>

    In our project we have to display items in sets of 500 total results can be 20000-40000. My problem is that I can not sort the result . Order By fails when you use rownum.

    Does any body have any better ideas how to sort the results using queries with rownum ?

     
  14. If your central business SQL request contains an "order by" clause, rownum cannot be correctly given by Oracle: because the engine first get all data - including rownum calculated field, so - in function of where clause, and does execute the sort in a second step, without recalculating rownum;
    Try the central part of your SQL request to see it:
    select rownum rno, a.alert_id,a.alert_author from gf_alert where alert_desc like '%<keyword>%'
    order by a.alert_author

    (there is no problem if by any chance the sort does change any record order, for instance if you sort on the id field)

    A second problem is that you cannot put in a same request the selection of rownum field and a condition like "rownum > something" in the where condition. This condition is verified when at each buiding of a result line; when the first line is build, its rownum is 1, the condition isn't verified, so the line is excluded; then second line is build, but its rownum will also be one, as the result was still empty, and will also be excluded... The result set is empty at the end !

    The only solution is to do 3 imbrications.
    Example: if the central business request is
    select a.alert_id,a.alert_author
    from gf_alert
    where alert_desc like '%<keyword>%'
    order by a.alert_author

    the request to extract a page should be: (works with recent version of Oracle only)
    select rn, alert_id, alert_author
    from
    (
          select Rownum as rn, alert_id, alert_author
          from
          (
                select a.alert_id,a.alert_author
                from gf_alert
                where alert_desc like '%<keyword>%'
                order by a.alert_author
          )
          where rownum <       -- we can begin to limit the number of results here
    )
    where rn >= <index>
  15. Overall I think this is a great start to solving a very difficult problem. An interface approach is very nice indeed for abstracting out the behavior of numerous result paging strategies. Two issues come to mind.

    1. How do you deal with the infamous "Number of items in the collection" and other metadata type information? I've worked on a project that just paged through the query set, however the job required a "total number of hits." since this is almost always db dependant, it would be nice to have an abstraction for this type of metadata.

    2. Most of the critique that has been posted is unjust. The goal of this pattern is to eliminate the dependence of a paging scheme. If worrying about a jdbc connection is the biggest complaint, then perhaps a good review of your query schema should be evaluated. For example, most high volume, large result data, is typically queried from a STATIC data set. This means that maintain a connect to the database to maintain a cursor is unnessary because transactional table (or row) locks are not needed. This is ideal for the query and cache strategy. if the data is dynamic underneath, then this pattern would not hinder your current paging scheme, but might actually help build impl independence when you decide to change it.

    I like the getChunk() idea. This provides a nice point for extra processing. If a cache miss happens, all the code necessary happens in one place and all meta data can be updated.

    A strategy for the "open connection is a bad thing" crowd might be opening a connection, or creating a new query, and evaluating if the dataset hasn't changed. If it has you can still pull back the results, but notify the user that the data may be out of date. if your cache is big enough for, say, 5 pages, the user may never know they're headed for a cache miss and potentially inconsistant query/dataset coupling.

    Noah
  16. I dont think there will be a perfect solution to this problem , unless the database itself provides some special construct.

    Heres how I have solved it before:
    Lets say the query is going to return 10,000 objects and we do not want to have the overhead of creating 10,000 objects from the rows and caching and paging thru them .

    What we do is to split the request in 2 queries.
    The first query returns only the primary keys/OIDs of all the 10,000 objects using all the conditions specified . This may take some time ( depending on the query ) but atleast the resultset returned will be of small size .

    The next query takes in a list of these oids ( using the IN clause) and returns the actual objects . We can control/page the number of objects being created by using only a subset from the previously returned promary keys.

    Sort of like creating Objects only on demand , using the primary keys which are cached from a previous query .


    Of course this means that some time will be required to fetch and display the first page ( in case its an UI ) , but subsequent fetches will be faster as they use the primary keys.

    -Amit
  17. *** The next query takes in a list of these oids ( using the IN clause) and returns the actual objects

    The trouble with this approach is when your "pages" get too big, say tens of thousands of rows. You IN clause is going to become enormous. I know that MS SQL Server used to crash when given a SQl statement over a certain size.

    Given that the paging problem is designed for scalability, I don't think the IN clause really cuts it.
  18. ona can fix the max number of objects in an IN clause (say 'n') and then split large numbers with multiple OR-ed IN clauses :

    e.g.
    ...
    AND ( objId in (A1, ..., An)
          OR objId in (B1, ..., Bn)
          OR etc.etc. )


    Could this work?
  19. I disagree a lot with the solutions presented in this article, although I appreciate the efforts invested by the author and his discussion about the importante of abstraction for the sake of good design.

    Here are my opinions about the article:

    1.- It's VERY bad practice to cache database resources (Connections or Resultsets) per user requests. It will consume lots of scarce resources and cause scalability problems. If you are considering a portable desing pattern (for multiple database servers), then it does not make sense to mention Oracle specific features as a solution to very fundamental problems.

    2.- AppServer connection pooling should be considered as the de-facto scenario in a J2EE AppServer, for the sake of portability, so considering driver's specific features is against J2EE app portability benefits.

    3.- It should be modeled as a stateless mechanism, in order to make it scalable, and thus a price must be paid in terms of efficiency because it is hard to avoid some kind of query re-execution.

    4.- Any SQL techniques considered for this solution should be based on standard ANSI SQL, so it can be executed with any database server.

    A possible approach:

    a) Execute the query and build a bookmark list that will be saved to a temp storage. This BookmarkList will containt the primary keys of the records, very compact even for thousands of records. With this list you can calculate the number of records and the number of pages.


    b) When a page is requested, load the subset of primary keys involved in that page and send a query like:

    select .... where...AND pkey_field in (pkey1, pkey2, pkey3...pkeyN)

    Ths will be very fast and stateless, and more important, scalable, portable and compatible with the J2EE APIs for connection pooling, independent of any database specific vendor feature.

    Regards,
    Martin Cordova
  20. As I mentioned before, IN clauses are not scalable. If your pages are large, then the SQL statement could be very large, and could give some database systems indigestion.

    Given that we're after a generic, database-independent solution, I don't think the IN clause cuts it. I wish I had a better alternative, though.
  21. I am thinking using the IN solution. The IN clause only contains the keys for one page. Say 10 results per page, so only 10 keys. Should that be a scalabilty problem?
  22. <quote>
        Given that we're after a generic, database-independent solution, I don't think the IN clause cuts it. I wish I had a better alternative, though.
    </quote>
    I think you are right - cache all the keys is too expensive from memory point of view plus some databases don't support large IN clause - Oracle 7 supports only 254 parameters (not sure about Oracle 8, 9).
    I created bean which caches only bookmarks i.o. if page size is 50 records and total rowcount is 50,000 it caches just 1000 rows. There some limitations but for the most cases it works.
  23. <Quote>
    I think you are right - cache all the keys is too expensive from memory point of view plus some databases don't support large IN clause - Oracle 7 supports only 254 parameters (not sure about Oracle 8, 9).
    </Quote>

    Why do you need a large IN clause? The IN clause should only contains keys for one page. I don't think you want to display more than 20 records per page.
  24. Although I advocate keeping database connections open and didn&#8217;t intend to get into SQL coding techniques, an approach I have used in the past when I wasn&#8217;t keeping database connections was to fetch only primary keys and save them compactly in memory with overflow to disk. When a page was to be displayed I would insert the primary key values for that page into a temporary table and use a SQL statement like the following to fetch the data for that page:

    Select &#8230;
    From application_table
    Where (<primary key column&#8217;s>) in
      (select pkvalues from temporary table)

    This keeps the size of the SQL statement fixed. It also saves overhead in the database because it avoids the reparse associated with a variable length IN clause, although there are techniques to get around this as well.

    When I say temporary table here I mean true temporary tables that are managed automatically by the database and which are session specific. They are supported by the ANSI SQL spec and therefore their use is vendor independent.

    Of course, this isn&#8217;t as efficient as using a large IN clause because there is overhead in doing the insert into the temporary table every time you scroll to a new page. However, the overhead of inserting into a temporary table is less than that of a permanent table.

    I have also used the IN approach and although it does impose an upper limit to the number of elements per page that limit is extremely high in most major commercial vendor products. When I used the IN approach I used bind variables for the PK values rather than embedding the PK values directly in the SQL statement which noticeably limited the amount of reparsing overhead and storage for parsed SQL representations in the shared SQL cache.


  25. Steve,

    By the way, we met before at an Oracle conference where we moderated a discussion group together.

    Thanks for the link to your white paper on BC4J. It was interesting reading and you&#8217;re right about the similarities between the BC4J approach and the DataList Handler. However, there are differences in the details. For one, the DataListHandler pattern does not try to address the problem of keeping the results of queries in sycnch with updates done in the current transaction as the BC4J framework does.

    The DataListHandler is designed strictly for retrieval performance efficiency. The BC4J RowSet Interface seems to support a traditional iterator interface, while the DataListIterator does not. The DataListIterator interface addresses the problem of avoiding the overhead of instantiating a physical Object for each logical item returned from the query. In the case of the RowSet interface it is necessary to instantiate an object for each item returned because you appear to cache it for update and transaction consistency purposes. In the case of DataListIterator, we are not concerned with caching and simply with getting the data to the client as quickly as possible presumably for display. To that end the DataListIterator.get() method and the ResultSetRowMapper interface allow the client to instantiate at most a page's worth of Objects to store return items and to reuse them while paging through the results. The interface allows an implementation in the data path from JDBC resultset to a client object as direct as possible while still maintaining an abstract interface.

    All in all BC4J is very interesting. In fact, it is clearly very similar to JDO in its approach. Of course, it addresses some issues that the JDO spec does not specifically address. I found it interesting how it keeps the data returned by queries within a transaction consistent with data previously updated but not commited in the same transaction. Very clever. However, does your View Object where clause also somehow take into account uncommitted data in your server cache? If it didn't and simply was passed onto the Oracle SQL engine, then it is possible that the query could return data that it would otherwise not have returned if it had somehow taken into account the uncommitted updates. Making sure the objects returned by the query are consistent with uncommitted updates seems to be relatively manageable. However, making the Where clause aware of uncommitted updates seems to be a far more difficult problem to solve. I suppose you could always apply the uncommitted changes to the database without committing them prior to executing the query.

  26. It is obvious from the discussion that I should have described in detail how to implement an efficient database session expiration scheme. I didn't expect that this would be the primary focus of discussion so I just described how to do so at a very conceptual level. Some people said that it is not practical to keep database sessions open across page requests. I would be curious to know why exactly that is so. I have implemented in a moderately sized production application a session expiration scheme where database sessions were maintained in a static cache in the servlet tier that worked very well in production. As a followup to the article, I intend to describe in detail a scheme that works in conjunction with the Data List Handler relational database strategy for expiring idle long-lived database connections in the Servlet tier where the need is most pressing when we talk about fast retrieval and scrolling of large query result sets.

    In brief, the scheme involves building a layer on top of the application server session pool API that registers open connections in a cache maintained in the ServletContext object. A background process periodically scans the cache and closes database connections that have been idle too long. The Data List Handler implementation is modified to make sure the database connection is still valid everytime it gets a request for another chunk of data. If the connection is open, it simply scrolls as it does now to the next set of rows. If the connection is closed, then it delegates to the application specific Data Access Object to reexecute the query and then it positions the resultset to where it was prior to the point when the database connection was expired. At this point it scrolls to the next set of rows as it does now. The complexity in the design comes into play if we wish to handle transparent failover in a clustered application server environment because database connection objects are not serializable and, therefore, cannot be failed over. However, the problem is still solvable.

    Although the scheme works in the Servlet tier, it can be adapted to the EJB tier for session beans.

    The scheme does have its limitations but I don&#8217;t see why it is either inappropriate at the conceptual level or why it is impractical. The reality is that any scheme that does not keep the result set and the associated database connection open across page requests will not perform as well as one that does. The solutions tha at involve fetching just the primary key&#8217;s and bookmarking them still cause a big delay in displaying the first page of results. Given that it is possible to implement an efficient portable non-database vendor specific strategy for expiring long-lived sessions and making it transparent to the Data List Handler API, I can't see why one would choose any other technique. All other techniques are guaranteed to give suboptimal performance. Since I haven't actually presented the details for the expiration strategy, there may still be some skeptics. The only thing I can say to that is to reserve judgement until I have presented the details. In fact, I encourage any of the readers to suggest their own connection caching and expiration schemes. Beyond that we will have to rely on actual real life implementation benchmarks to determine whether or not the solution is scalable.

    The ultimate solution I believe will be in the following areas some of which I have already touched on the article.

    1) The JDBC specification will be changed to support idle-connection timeouts with the ability to register a listener for an idle-connection timeout event. Application server vendors can make use of this feature to provide a ready-made connection caching and expiration scheme within their session pool functionality. This would eliminate the need for us to develop our own.

    2) The database vendors are changing (and have changed in some cases) their implementations to make the overhead of an open connection so low that it will be able to support 1000's of open connections without difficulty. This will allow a database connection to go hand in hand with a application tier session. The database session would be closed when the application tier session is closed. This also eliminates the current need to funnal all database requests through a single fixed database identify. We could then preserve the application identity in the database tier which would make coding database tier logic much easier than it is now.

    As I mentioned in the article, Oracle has already made progress in this area. In the server tier they have made the memory overhead associated with an open connection very low. In the JDBC and database interface tier, they have implemented a mechanism to multi-plex multiple logical database connections over a limited number of physical connections. This is the call-level connection pooling scheme that I described in the article. It represents a significant improvement over the standard connection pooling schemes implemented in products like WebLogic.

    The changes will be both in the server tier and in the JDBC driver tier.

    Some readers objected to taking advantage of database vendor specific features to solve the problem. I think that in not embracing functionality that can best only be implemented in the database tier is completely impractical. In so doing we must reduce databases to the equivalent of a simplistic file system or a PC Based database like DBASE. We would have to make the 95% of the functionality available in a database product off-limits in our solutions and try to solve in the application tier problems that are actually database problems. This may be OK for departmental solutions, but it certainly will make it impossible to develop high performance enterprise solutions.

    Having a strong database background along with a software engineering background, I actually have a lot to say about how the J2EE standard has fallen short in incorporating the database tier into its approach. My guess is that if the J2EE standard was driven by a company that also sold databases, that it would look very different. For one, the EJB spec would have never taken the form it did. The best solutions to problems are those that render to the application tier that which it does best and to the database tier that which it does best.
  27. If your search results are more than a single page then its time to ask the user (or maybe assist the user) to refining the search criteria??? How many times have you gone to a search engine, entered criteria, then only browsed through the first page of results. Dah. We all do it... Now I'd like to see some code that really helps the user! ;-)

    VisualJeff
  28. Jeff,

      Well sometimes, actually most of the time for me, I find what I need on the first page that is returned. If not, then first page clues me in on how I should refine my search. So IMHO this article and discussion is very valid.

    -Tom
  29. I would be so much more interested if there were any performance figures in this article, any comparisons between the possible solutions.
    Right now, the only thing I've got is a logical argumentation about data that isn't retrieved until it's necessary, and keeping JDBC connections open...
    As far as performance is concerned however, you cannot rely on logical arguments. There are too many degrees of freedom which you can't really assess.
    Is there anyone who actually compared the performance of these alternatives ?
  30. In order to perform such a performance comparison, wouldn't you need a framework or use case that everyone adheres to? If it's agreed that there should be a standard way of doing searches and returning large datasets, then this design pattern is perfect and gives a way to truely evaluate each implementation. If it becomes clear that the pattern is amiss and does not provide a suitable framework, then revise the pattern.

    Seems pretty simple.

    Noah
  31. Some more food for thought...[ Go to top ]

    Distributed Result Set Iterator

    A Design Pattern for Efficient Retrieval of Large Data Sets from Remote Data Sources