Iterative Data Access Pattern (Complete)

Discussions

J2EE patterns: Iterative Data Access Pattern (Complete)

  1. Iterative Data Access Pattern (Complete) (20 messages)

    This idea is not new and comes mostly from the CORBA domain and is an adaption of the G.O.F. iterator pattern. When a client calls an Entity Bean the standard practise is load all of the results into a collection and return it. If the result set of the query is quite large (1 000 000+ rows) then a large amount of the server (and possibly network) resources will be consumed while returning the data.

    The main premise of this pattern is to store the ResultSets of queries on the server within a Stateful Session and allow the client to receive the result set in nicely sized chunks.

    The queries however are not permanent though. They will be removed by ther server during bean passivation. There will have to be an exception that the client will understand as this query is no longer available please recreate.

    Firstly we have a repository object responsible for holding the queries and generating Data Objects from the

    public abstract class DOFactory {

      ResultSet rs;
      public DOFactory (ResultSet rs) {
        this.rs = rs;
      }

      public abstract Object next();

      /**
       * Do general clean up. May want to hold actual statment/connection in factory.
       */
      public void close() {
        Statement stmt = rs.getStatement();
        Connection con = stmt.getConnection();
        rs.close();
        stmt.close();
        con.close();
      }
    }

    public class QueryRegistery extends Hashtable {
      long nextQuery = 0;

      public Long registerQuery(DOFactory factory) {
        Long queryContextId = new Long(nextQuery++);
        put(queryContextId, factory);
        return queryContextId;
      }

      public Collection next(Long queryContextId, int n) {
        Collection c = new ArrayList();
        DOFactory factory = (DOFactory) get(queryContextId);
        if (factory == null) {
           // Throw an application level exception so that
           // the client will know that it has to recreate
           // the query.
        }
        int i = 0;
        for (; i < n; i++) {
          Object dataObj = factory.next();
          if (dataObj == null) { break; }
          c.add(dataObj);
        }
        // Clean up if we have reached the end of the result set.
        if (i < n) {
          factory.close();
        }
        return c;
      }
        
      public void close(Long queryContextId) {
        ((DOFactory)get(queryContextId)).close();
      }
        
      public void closeAll() {
        // close all open queries.
      }
    }


    Next we have to integrate this into a Stateful Session Bean to make use of it. Each SFSB requires 2 additional delegate methods: next() and close().

    /**
     * @ejb.bean type="Stateful"
     */

    public AccountQueryBean implements SessionBean {
      private transient QueryRegistry registry = null;
      public void ejbCreate() {
        registry = new Registry();
      }

      public Long findAll() {
        Connection con = getConnection();
        Statement stmt = con.createStatement();
        ResultSet rs = stmt.executeQuery("SELECT * FROM ....");
        AccountDOFactory f = new AccountDOFactory(rs);
        return registry.registerQuery(f);
      }

      public Collection next(Long queryContextId, int n) {
        return registry.next(queryContextId, n);
      }

      public void close(Long queryContextId) {
        registry.close(queryContextId);
      }

      public void passivate() {
        registry.closeAll();
      }

      public void activate() {
        registry = new Registry();
      }
    }


    Drawbacks: The idea is that this pattern provides extra flexibility for the client application, but in doing so removes control of resources (e.g. database connections) from the server. Therefore poorly written client code could have an adverse affect on the server.

    Possible extensions: Prefetch caching of results. Timing out of individual queries rather than relying on Bean passivation.

    Threaded Messages (20)

  2. Iterative Data Access Pattern (Complete)[ Go to top ]

    This is already a pattern. Value List Handler or Page by Page Iterator.

    Also, holding onto a database connection in this fashion will not scale very well. I've found using Oracle's ROWNUM or PostgreSQL's LIMIT/OFFSET features to be a better approach. There's actually a section on this in _Bitter EJB_.
  3. Iterative Data Access Pattern (Complete)[ Go to top ]

    The value list pattern "Core J2EE patterns" suffers from the problem that I am trying to avoid. Loading all of the data into memory on the server before returning results to the client.

    Scalablity issues (do to large numbers of open database cursors) can be addressed at least partially by implementing query timeouts and being a little bit ruthless about how long a query can "live" for (poorly written clients can still cause issues though). If the client is well behaved (closes connections after use) then the cursors will not open 'much' longer than if all of the data was read at once. It all depends on the nature of the application. Systems with large numbers of small queries using this pattern will not perform well.

    This also allows for a thread of a program to process results while another waits on the getting the data from the database. This allows for the whole application to be more responsive.

    The main goal is to allow the client to make some decisions about resource consumption. If the client is happy using lots of memory then sufficently large values of 'n' will ensure that the result set is closed automatically as the result set will be exhausted (note the i < n check in the next() method). However it will require 2 remote calls for a single query, so again there is a trade off between effenciency and flexibility.

    The ROWNUM and LIMIT/OFFSET options are good ones, however your SQL will then be database specific (although this may not be a major issue depending on the implementation, occasionally optimising SQL for a specific db is necessary in order to get reasonable performance).
  4. Problems[ Go to top ]

    The ROWNUM of oracle is not a recommended approach, as the solution will tie the implementation to a specific database. Plus, if the use want's to see a SORTED result on the front-end, it will become a cumbersome process to sort it.

    Leaving the database connections open is also a problem.

    BTW, executing SELECT statement is not a good approach. Instead we can have DAO's doing this.

    Wouldn't it be nice if we maintain the QueryRegistery from the DAO...
  5. It is possible to implement paging in the database. Eg you do not need to return huge queries back to the EJB layer.

    The approach for this is to run the query and insert the results into a Temporary Result Table on the database, along with a query id and an individual row number for each row. Then the query run by your DAO passes in the query id, the current position in the resulset and the number of rows required (page size) . This query selects a single page of rows from the Temporary Result Table.

    In this way you never need the app server to hold more than one page of data at one time. In my own particular implementation we have hige resultsets which need aggregating in the EJB layer and rendering to csv or PDF format. This approach means that the overall perfromance is slightly reduced but the amount of memory required by the EJB container to hold the resultset is greatly reduced.
  6. Anybody care to comment on this Page-by-page iterator variation which uses EJBQL and CMR instead of JDBC. I haven't used it or verified it in any way.

    Not so tasty looking page..
    http://citeseer.nj.nec.com/544377.html

    ..then on the top right there is a PDF in the 'view or dowload' section which has explanation and source code.
  7. My few cents[ Go to top ]

    Guys,

    It is my true belief that the pattern described here is a very good approach for general purpose iteration over result set. Now, I am not advacating that keeping database resources open is a good strategy, but with some smart usage of tools at hand we can make the pattern function properly, in a scalable fashion, and with freeing up resources of database.

    Querying approach.

    What if we defined an ordering in our selects, based on the supplied criteria, or even better based on keys in the table(s).

    Then what we can do is actually query database as often as we needed to retrieve following/previous page of information, according to ordering we had in mind.

    As an example consider that we leave in the country where everyone has a unique first name (yea right). We can build an iteration approach as follows.

    1. First query
    select * from tbl_people
    order by ppl_fname

    2. We then return a page of people to the client
    with the key into this result set, key containing last ppl_fname entry encountered and for the previous page first name encountered in the page.
    (note that we have to handle boundary cases of start-of-resultset and end-of-resultset)

    3. If the next page is desired
    The client can invoke
       PeopleService.nextSetOfPeople
    Underlying this invokation would be derived from the
    key passed into nextSetOfPeople query

    select * from tbl_people
    where ppl_fname > 'gandolf'
    order by ppl_fname;

    Note how above query queries a database, but efficiently jumps to the
    right place in result set.

    This approach is very scalable, and cross-database portable (no ROWNUM)

    In addition, we could choose to cache VOs returned by the service in multipes
    of page size, and provide partial view into a complete result set.

    This could be a part of the datastructure that is stored by the statefull bean for a client, which will still be probably more efficient than holdingt on to JDBC resources.

    The stateless approach has been used by me on numerous occasions, and worked out very well, both in terms of performance and scalability.

    I will go with stateless approach over statefull whenever it is possible.

    Hope my scribble here makes sense, Mikhail.
  8. I had such problem in a project and finally we implement
    a compromise that allow to get a subset on the server (e.g 300 to 1000 rows among the 1 000000). The client can play with this subset using page up, page down until he needs some rows that are not stored on the server. Using scrollable cursor (closed after each access) and maintaining a "pagination state" one can provide a new subset (that can overlap partially the primary one to avoid a new scrollable query if the client ask for page up) and so on. This technic has several advantages:

        - control of the server memory
        - does not maintain connection with the database
        - smooth in time the number of real accesses to the database
        - except for real access to the database, the client feels that accesses are
    ²
  9. Sorry, this is the complete message:


    I had such problem in a project and finally we implement
    a compromise that allow to get a subset on the server (e.g 300 to 1000 rows among the 1 000000). The client can play with this subset using page up, page down until he needs some rows that are not stored on the server. Using scrollable cursor (closed after each access) and maintaining a "pagination state" one can provide a new subset (that can overlap partially the primary one to avoid a new scrollable query if the client ask for page up) and so on. This technic has several advantages:

        - control of the server memory
        - control of the network data flux
        - does not maintain connection with the database (resultsets are closed
          after each query to the SGBD)
        - smooth the frequency of accesses to the database.
        - except for real access to the database, the clients feel that their
          queries are fast.
        - allows to deal with queries having a complicate ORDERBY clause (in such
          cases you cannot localise the last row with a WHERE clause)

        Some drawbacks:
        - your application is limited to SGBD that support scrollable cursors (a
          forward_only cursor is enough)
        - As you should close resultsets after each access (in order to release
          memory in the temporary table used by the SGBD to support scrolling)
          you must execute your query each time you need a new subset from the
          database.
        - may be some others...

        With some basic patterns such as factory and strategy, you can build
    a small framework to deal with large amount of data.

    VM
  10. This aproach to pagination has two serious problems that makes it useless, if not dangerous:

    1. Leaving JDBC object hanging around is one of the worst practices. This may be will work for a single user, but when you have tens and hundreds of remote clients opening their browsers and never clicking "Logout" - you'll have doubled and tripled amount of open Statements, Connections and ResultSets without any chance to be reclaimed. This can consume out resources both on the middle tier and on the db side very quickly.

    2. JDBC objects don't survive SFSB passivation.

    Basically, it's not going to fly, at all.
  11. Two things...[ Go to top ]

    1. As someone else already mentioned, this is already a pattern--the Page-by-Page Iterator (or the more recent nomenclature of Value List Handler, both according to Sun Core J2EE Patterns). It's a pattern that's certainly not unique to Java/J2EE apps, but is very common to any app that has "large" result sets (more entries than can be displayed on one page). Think Google search engine results: do you want Google to re-execute the search for each succeeding page and subset only the ones for the particular page number you're asking for, or...

    2. Noting the many qualms regarding holding onto a JDBC connection during this process, there is a much better way. JDBC 2.0 gave us the RowSet, effectively a disconnected implementation of the ResultSet. If you fill a ResultSet with the results of a JDBC query, and populate a disconnected RowSet (e.g., the CachedRowSet or WebRowSet implementation) with those results, you can free up the database connection immediately. You have, in essence, a CACHED RESULT SET from your query. You can then extract subsets of those results and display them accordingly. Since RowSet implementations should be serializable, they need not hog real memory. If you play it smart, you can wrap a RowSet as a "tabular" object (essentially implemented as a List of Maps), and use the various powerful capabilities of the List interface to "subList" its contents as needed on a page-by-page basis. Alternatively, if you find the ResultSet/RowSet API limiting, directly build a real tabular object (not a wrapper) for this purpose (though this may be more resource-consuming for really large result sets).

    It always disturbed me that we have the Collections API, which is really powerful but underutilized, yet JDBC uses its own more complex API. Seems to me we can leverage Collections for database query results (and results from other kinds of queries as well) to make a generalized tabular access methodology for such results.

    Just a thought...
  12. Two things...[ Go to top ]

    The problem with Cached RowSets is that they are not suitable for large sets of Rows e.g 300 000 + rows, mostly it gives out of Memory when populating from a ResultSet.(this is true even if you have a large initial and max heap of the VM)
  13. Two things...[ Go to top ]

    I have the impression that if we have to manipulate 300 000 + rows, there are mainly two cases :

    1) We are sending read-only data to a client

    In this case, I don't know many end-user who would read all the 300000 rows! Perhaps we can limit to some hundreds cached server-side (and so release the connection, no 'out of memory' problem,...).
    Of course we also need to warn the client about the fact that his request was too wide and that he should refine its criteria.


    2) We are doing some server-side/batch processing

    Then we better have to design our processing to work row by row. So, we only instantiate the context related to one row at a time.
  14. Two things...[ Go to top ]

    I'm not sure I understand what's being suggested as an alternative here. If CachedRowSets (or other disconnected containers as well?) are unsuitable for very large result sets, what IS suitable? Leaving the Connection open and holding on to the ResultSet, I thought we all agreed, is NOT a viable option for a variety of significant reasons. Therefore... the data has to go somewhere once the Connection is severed. A CachedRowSet is one choice, but it isn't the be-all and end-all, for sure. In fact, the container I suggested, the "tabular" object, seems a better fit because it is generalized--it could be used not only for JDBC query results but as a generalized container for ANY kind of query results (LDAP, search engine, etc.)

    If our argument is about the limitations of a particular implementation of this sort of disconnected result-set container, fine. Sun probably never meant the CachedRowSet and WebRowSet implementations to be anything more than an initial example. But what do we want in a better-designed alternative? Personally, I like the idea of getting away from the ResultSet API entirely, since it's limiting and confines us to RDBMS semantics. My suggestion is a container that is a List of Map objects, where each Map object represents a row. The List API opens the door to the very kinds of sublisting, sorting, and filtering mechanisms we would want to solve the problems the original author of this thread is talking about.

    But regardless of whether we go down that road or not, I think we can agree that some sort of disconnected container is required--maintaining a JDBC Connection across requests to hold on to the contents of a ResultSet is simply not good practice.
  15. I Conceed[ Go to top ]

    Although I am not 100% convinced that the pattern is entirely bad, as it works well in our domain (small number of clients, few large queries), but the majority rules and I agree its not good in the general case.

    Here is alternative, which in reality is just a slight enhancement to the Value-List iterator.

    Basically excute the query but only retrieve the primary key. Cache this in a list on the server. Requery for a the objects that match the next n keys in the List. Conserve resources by removing the used keys and effectively making it forward only. Assuming you have an efficient primary key (e.g. Long) then resource consumption will be fairly reasonable (1 000 000 rows will equate to around 12MB, a couple of orders of magnitude less than holding the full DO).

    This will meet my basic requirements.

    1) A level of data consistancey Query will not be adversly affected by inserts increasing the size of the data set (imagine a batch based process that was selecting using ROWNUM slower than inserts are occuring). Updates we just have to grin and bear.
    2) RDBMS agnostic (again a downfall of ROWNUM).
    3) Still allows the client to control what is sent across the wire and to a certain extent the level of resource consumption on the server.

    The query context id stuff could optionally be used if you wanted to use the same SFSB for multiple queries.

    And yes I do realise this solution is kind of obvious, and is probably available in literature that I have not yet read.

    I still feel few obligated to defend a couple of points:

    > ...you'll have doubled and tripled amount of open Statements,
    > Connections and ResultSets without any chance to be reclaimed

    Untrue, all open connections for a SFSB are reclaim when the bean is passivated:

    public void passivate() { registry.closeAll(); }

    Also you can implement a timeout and/or a max number of queries and cull open queries in either or both situations.

    > 2. JDBC objects don't survive SFSB passivation.

    Thats probably why I make the registry object transient and close all of the connections when the SFSB is passivated.

    > ...we have the Collections API, which is really powerful but
    > underutilized, yet JDBC uses its own more complex API.

    AFAIK, the JDBC API is not its "own", but is based on a standardised API. I can't recall the name of it, but I believe the ODBC API on the MS side is very similar is structure. You have be wary of using a API that doesn't contain semtanics that map to a DB Cursor otherwise you will run into other scalability issues, very similar to the ones that I was trying to avoid will this pattern. A couple of JDO implementations that I have looked at suffer in this manner and some have subtle solutions to it e.g. Castor's lazy loading.
  16. I Conceed[ Go to top ]

    I see that it has already been mentioned.

    Message #45624.
  17. No need to concede[ Go to top ]

    First of all, where is message #45624? I'm not sure how one gets to a particular message number without knowing the thread here on TheServerSide.com.

    Also, you propose the idea of running a query to get JUST the primary keys (however many there are in total), caching those results, and then running queries for each "batch" or "subset" using the primary key as a WHERE clause qualifier to get the complete rows for the range you desire. It's an interesting alternative, but I venture that it would be an inefficient use of database resources, since you are connecting and executing a larger number of queries in the long run. (Though not necessarily so, since very often searches or other web queries producing large result sets are only browsed through the first few pages, so with your alternative you're only getting the results you need when you really need them. So... that's something to think about.)

    Finally, regarding your comments about my contrast between the Collections API and the JDBC API: agreed, JDBC's "own" API draws its pedigree from older more established APIs like ODBC (and some native database APIs), but that's the point. In my opinion, doing it "the way it's always been done" may get you closer to the bits and bytes of what's actually happening, and that has some value to some, but as a high-level abstraction it is awkward and clumsy. My thought is that maybe we have something better in abstracting database query results as a List of Map objects, especially since you can use the same abstractions for ANY query, search, or tabular data request (e.g., LDAP, e-mail/messaging systems via POP/IMAP/NNTP, search engine). Imagine if today you have a database as your data source, but tomorrow you're going against an LDAP directory. Would you rather convert all your application-specific code from JDBC to LDAP (or even JNDI), or would you rather let lower-level common code do the dirty work of getting your data for you and formatting the results in a single uniform format, so that you don't have to change your application-specific code because your data source type has changed?

    THAT'S my particular enchantment with a uniform "tabular" data format.
  18. all you've talked solutions using jdbc2 but what about when you are using cmp the only thing i've been thinking so far is to hold an amount of objects in a collection and retrieve this data with some programing techniques but as you all have said when we hold a large collection this would end on a memory overflow or something like that.. what can we do
  19. You could use ejbSelect() to get only the primary keys and keep them in a SFSB. When the client asks for, say, the next 50 rows, get the value objects, put them in a Collection and return it.

    Though I can't use this because I need to be able to sort the results without firing off the search again. I need the full data at hand.
  20. Nothing comes without cost[ Go to top ]

    There is no miracle cure to the problem of large result sets. We cannot "make a wish" and create a reality in which a result set with >300,000 rows somehow is compressed to only take up 7 1/2 bytes of memory. I don't think even Java 1.5 can do that. :-)

    So, if you want ALL the data, you have to get ALL the data, and if you want to go through it all, you have to keep it all somewhere. The question becomes how do you optimize. The suggestion of getting all the primary keys only, then getting the full rows of data in chunks by specifying ranges of primary keys, is a great idea. But as others here have mentioned, it doesn't always work, requirements vary as to what you want to do with the data--for instance, what if you want it all but want to page through it sorted by some other column value? Selecting just chunks of the data (by ranges of primary keys) may not get you want what you want there, as Markus pointed out. Still, the solution is worth pursuing and exploring for many situations.

    If you can put the results of your query into a chunkable serializable container, this might alleviate a large number of the problems. The size of the result set is not going to be magically compressed, if you HAVE to keep it, it's got to go somewhere.

    You have to ask: what are you trying to optimize, and how can you optimize it all? Number of connections opened and query requests made to the database, volume of traffic over the wire (size of responses) from the database, RAM (if needed to hold results), disk cache, etc.
  21. The following is the paging pattern I am currently implementing.

    1. Create an index into my data which allows each property of a domain object
    to be pre-sorted.
    2. Query against the pre-sorted data only. This is efficient as when paging you can use a between clause (ordering values are well formed, i.e. 1, 2, 3, etc.).
    3. Use a batch process to re-order as needed (value can bunch up).

    The idea here is that I want to only return pages and since the orders are well defined numbers I do not require a database specific keyword (ROWNUM, limit, etc.).

    Example:

    1. User requests the first from my product table sorted by name ascending. I execute some SQL like:

    select object_id from order_table where class='product' and property='name and order between 0 and 20.

    2. This returns the first twenty primary keys which I can then look up using an IN clause or one at a time. Either way it will be pretty quick.

    The result is that I perform 2 or 1 + N queries where N is the page/batch size (which is is less than 25 for my app). With caching and/or In clauses we are talking very quick performance. Note that the db connection is now closed and resources are released. I track the current order offset in my code.

    What does everyone think???

    Bruce