It is a common problem:user execute a runtime query and
hundreds(or even more) records get found. User navigate
the result set page by page.
Design 1 - stateful
execute only one query and maintain a long live resultset
(or database cursor).
Design 2 - stateless
donnot maintain a long live cursor, each page generates a
new query follow by seek, read and close.
Which design is better?
Design 1 seems to be faster but I doubt it will scale, e.g.
the jdbc connection cannot be return to the pool because of
the pending resultset.
Design 2 seems to be more scalable and although database
should cache the execution plan and data, but there still
will be a performance hit...
I think design 2 maybe better because design 1 let the end
user control the lifecycle of the resultset, isn't that
break the basic principle?
I am wondering what is the common practice for this..
Anybody here working for Google? or the ServerSide?
-
Scalable design for long run query (7 messages)
- Posted by: qing yan
- Posted on: May 20 2002 12:55 EDT
Threaded Messages (7)
- Scalable design for long run query by Alex Pisarev on May 21 2002 05:36 EDT
- Scalable design for long run query by Greg Nudelman on May 21 2002 13:51 EDT
- Scalable design for long run query by qing yan on May 21 2002 21:29 EDT
- Scalable design for long run query by Alexander Ananiev on May 22 2002 00:04 EDT
-
Scalable design for long run query by qing yan on May 22 2002 03:02 EDT
- Scalable design for long run query by Alexander Ananiev on May 22 2002 11:10 EDT
-
Scalable design for long run query by qing yan on May 22 2002 03:02 EDT
- Scalable design for long run query by Leonard Gurevich on May 24 2002 10:15 EDT
-
Scalable design for long run query[ Go to top ]
- Posted by: Alex Pisarev
- Posted on: May 21 2002 05:36 EDT
- in response to qing yan
I reckon that Google solutions will not be so easy to implement... :)
Anyway, the second option is much better, because sometimes the amount of data requested could be much more than your memory could handle. It's not a good idea to keep connections open for such a long time as well - you will simply get out of your resources. If you caring about performance, I recommend you to use caching of some sort - the table of the most common/latest queries associated with data to be reported back (I assume that Google does the same). The use of caching, however, could be a serious problem if you need to get back only the latest data in real-time - usually it takes a lot of time to invalidate your cache (so it's very rare operation, usually executed on a daily basis).
Alex. -
Scalable design for long run query[ Go to top ]
- Posted by: Greg Nudelman
- Posted on: May 21 2002 13:51 EDT
- in response to qing yan
I agree, option 2 is much better/more scalable. A good RDBMS already takes care of most of the caching, unless your query design is bad -- i.e. you perform multiple DB hits in a loop for example. (Don't laugh. We had a Ph.D. that did exactly that.)
Anyway, don't worry about the bottlenecks this early on. (It's good to think about it though!) Chances are, it will NOT be a DB caching.
Greg -
Scalable design for long run query[ Go to top ]
- Posted by: qing yan
- Posted on: May 21 2002 21:29 EDT
- in response to Greg Nudelman
Agreeded,option 2 and database managed cache would be
a neat combination, but there is a drawback. In option 1,
since you maintain the cursor you can give hints and let
the database prefetch data in the background...the same
thing cannot be achived without complicated caching
logic for option 2. -
Scalable design for long run query[ Go to top ]
- Posted by: Alexander Ananiev
- Posted on: May 22 2002 00:04 EDT
- in response to Greg Nudelman
In the most general case, yes, #2 is more scalable. But one needs to understand performance implications of both options. With #2, every subsequent request would have to retrieve more data from the database, since it is impossible to tell the driver to return only the subset of records (but it is possible to set the maximum). Database cache would not be not of much help here, it is the speed of passing data over the wire (from the database server to the app server) and converting from the result set into objects (very expensive). Also, the query would have to use ORDER BY which is the pretty inefficient operation for most of the databases and it is usually performed in a separate temp space.
On the other hand, modern databases scale pretty well in terms of the number of users. I am mostly familiar with Oracle and it does scale up to thousands in its multi-threaded configuration.
At the same time, you can set low timeout on the stateful bean to minimize the number of occupied connections.
So I can easily come up with the scenario under which #1 is more scalable than #2. Say, depending on your cluster, you may have a lot of users hitting the last page so your app server becomes processor bound trying to handle all these request. Whereas with #1, database server already has all requests pre-cached.
I wonder if anybody implemented #1 in the real life, so far I've only seen people implementing various flavors of #2 in J2EE environment.
Alexander. -
Scalable design for long run query[ Go to top ]
- Posted by: qing yan
- Posted on: May 22 2002 03:02 EDT
- in response to Alexander Ananiev
<q>
since it is impossible to tell the driver to return only the subset of records
</q>
I think you can, at least in theory. JDBC ResultSet have the
setFetchDirection and setFetchSize methods, basically
you can disable prefetching, seeking, and enable prefetching and set fetch size to the page size...of course
how it works in reality depend on how smart the driver or
database engine is.
The drawback of #2 as my understanding is the query setup
cost and data fetching not "pipelined".
<q>
it is the speed of passing data over the wire (from the database server to the app server) and converting from the result set into objects (very expensive).
</q>
Isn't this the same for both #1 and #2. Can you explain more
why you claim the converting from resultset into objects
are "very expensive", just because of object instantiation?
<q>
the query would have to use ORDER BY
</q>
Can you explain more why you HAVE TO use ORDER BY?
<q>
On the other hand, modern databases scale pretty well in terms of the number of users. I am mostly familiar with Oracle and it does scale up to thousands in its multi-threaded configuration.
</q>
Are thousands here mean physical connections or you are
using connection pooling?
<q>
At the same time, you can set low timeout on the stateful bean to minimize the number of occupied connections.
</q>
Same feeling..probably the best approach would be #1 with
timeout combines with #2..
-
Scalable design for long run query[ Go to top ]
- Posted by: Alexander Ananiev
- Posted on: May 22 2002 23:10 EDT
- in response to qing yan
In fact, you're right, with scrollable cursors it should be possible to retrieve just a subset. How it is implemented is a different question. Oracle doc states that it is slower. No wonder, it still has to traverse the index in order to get to the Nth record.
About order by, you don't have to use it but it is desirable since otherwise the order is not guaranteed. By the way, another downside of #2 is that it does not provide read consistency since you have a new transaction every time.
And yes, I witnessed Oracle scaling to a thousand of concurrent users. Bear in mind though, that Oracle uses its own pooling on the top of your app server, so these connections do not necessarily translate into a thousand physical processes.
Alexander
-
Scalable design for long run query[ Go to top ]
- Posted by: Leonard Gurevich
- Posted on: May 24 2002 10:15 EDT
- in response to qing yan
Design 2 is more scalable, but for several hundreds or several thousands of rows (not hundreds of thousands) you can
use design 1, just use disconnected CachedRowSet with JDBC in Stateless EJB. I implemented this using some "smart caching". It's not perfect, but there is always tradeof performance and scalability.