Scalable design for long run query

Discussions

EJB design: Scalable design for long run query

  1. Scalable design for long run query (7 messages)

    It is a common problem:user execute a runtime query and
    hundreds(or even more) records get found. User navigate
    the result set page by page.

    Design 1 - stateful

    execute only one query and maintain a long live resultset
    (or database cursor).

    Design 2 - stateless
    donnot maintain a long live cursor, each page generates a
    new query follow by seek, read and close.

    Which design is better?

    Design 1 seems to be faster but I doubt it will scale, e.g.
    the jdbc connection cannot be return to the pool because of
    the pending resultset.

    Design 2 seems to be more scalable and although database
    should cache the execution plan and data, but there still
    will be a performance hit...

    I think design 2 maybe better because design 1 let the end
    user control the lifecycle of the resultset, isn't that
    break the basic principle?

    I am wondering what is the common practice for this..

    Anybody here working for Google? or the ServerSide?

  2. Scalable design for long run query[ Go to top ]

    I reckon that Google solutions will not be so easy to implement... :)

    Anyway, the second option is much better, because sometimes the amount of data requested could be much more than your memory could handle. It's not a good idea to keep connections open for such a long time as well - you will simply get out of your resources. If you caring about performance, I recommend you to use caching of some sort - the table of the most common/latest queries associated with data to be reported back (I assume that Google does the same). The use of caching, however, could be a serious problem if you need to get back only the latest data in real-time - usually it takes a lot of time to invalidate your cache (so it's very rare operation, usually executed on a daily basis).

    Alex.
  3. I agree, option 2 is much better/more scalable. A good RDBMS already takes care of most of the caching, unless your query design is bad -- i.e. you perform multiple DB hits in a loop for example. (Don't laugh. We had a Ph.D. that did exactly that.)

    Anyway, don't worry about the bottlenecks this early on. (It's good to think about it though!) Chances are, it will NOT be a DB caching.

    Greg
  4. Agreeded,option 2 and database managed cache would be
    a neat combination, but there is a drawback. In option 1,
    since you maintain the cursor you can give hints and let
    the database prefetch data in the background...the same
    thing cannot be achived without complicated caching
    logic for option 2.
  5. In the most general case, yes, #2 is more scalable. But one needs to understand performance implications of both options. With #2, every subsequent request would have to retrieve more data from the database, since it is impossible to tell the driver to return only the subset of records (but it is possible to set the maximum). Database cache would not be not of much help here, it is the speed of passing data over the wire (from the database server to the app server) and converting from the result set into objects (very expensive). Also, the query would have to use ORDER BY which is the pretty inefficient operation for most of the databases and it is usually performed in a separate temp space.
    On the other hand, modern databases scale pretty well in terms of the number of users. I am mostly familiar with Oracle and it does scale up to thousands in its multi-threaded configuration.

    At the same time, you can set low timeout on the stateful bean to minimize the number of occupied connections.
    So I can easily come up with the scenario under which #1 is more scalable than #2. Say, depending on your cluster, you may have a lot of users hitting the last page so your app server becomes processor bound trying to handle all these request. Whereas with #1, database server already has all requests pre-cached.

    I wonder if anybody implemented #1 in the real life, so far I've only seen people implementing various flavors of #2 in J2EE environment.

    Alexander.
  6. <q>
    since it is impossible to tell the driver to return only the subset of records
    </q>
    I think you can, at least in theory. JDBC ResultSet have the
    setFetchDirection and setFetchSize methods, basically
    you can disable prefetching, seeking, and enable prefetching and set fetch size to the page size...of course
    how it works in reality depend on how smart the driver or
    database engine is.

    The drawback of #2 as my understanding is the query setup
    cost and data fetching not "pipelined".

    <q>
    it is the speed of passing data over the wire (from the database server to the app server) and converting from the result set into objects (very expensive).
    </q>
    Isn't this the same for both #1 and #2. Can you explain more
    why you claim the converting from resultset into objects
    are "very expensive", just because of object instantiation?

    <q>
    the query would have to use ORDER BY
    </q>
    Can you explain more why you HAVE TO use ORDER BY?

    <q>
    On the other hand, modern databases scale pretty well in terms of the number of users. I am mostly familiar with Oracle and it does scale up to thousands in its multi-threaded configuration.
    </q>
    Are thousands here mean physical connections or you are
    using connection pooling?

    <q>
    At the same time, you can set low timeout on the stateful bean to minimize the number of occupied connections.
    </q>
    Same feeling..probably the best approach would be #1 with
    timeout combines with #2..





  7. In fact, you're right, with scrollable cursors it should be possible to retrieve just a subset. How it is implemented is a different question. Oracle doc states that it is slower. No wonder, it still has to traverse the index in order to get to the Nth record.
    About order by, you don't have to use it but it is desirable since otherwise the order is not guaranteed. By the way, another downside of #2 is that it does not provide read consistency since you have a new transaction every time.
    And yes, I witnessed Oracle scaling to a thousand of concurrent users. Bear in mind though, that Oracle uses its own pooling on the top of your app server, so these connections do not necessarily translate into a thousand physical processes.

    Alexander
  8. Design 2 is more scalable, but for several hundreds or several thousands of rows (not hundreds of thousands) you can
    use design 1, just use disconnected CachedRowSet with JDBC in Stateless EJB. I implemented this using some "smart caching". It's not perfect, but there is always tradeof performance and scalability.