Discussions

Performance and scalability: Performance Considerations

  1. Performance Considerations (8 messages)

    Our application needs to access a central repository of Customer Information. The Repository team gives us a EJB Interface that we call regularly during the execution of our business logic.

    But a new requirement demands certain infomation from the Customer repository that is very huge.(10000*15) records per query.80000 queries per day.

    It is to our knowledge that the repository resides in Oracle DB.

    I have the following concerns regarding this new module:

    1) Should we propose a solution that goes against the existing design i.e EJB call. Basically a DAO that resides locally in our JVM.

    2) Using Intermediate Object for having clean Interface or Use ResltSet or a Rowset for passing the data across more efficiently. There by removing Object Creation Overhead.( Irrespective whether DAO lives in the client JVM or DAO is a Session Bean in their Server)

    3) Should i cache some data. But the query results are not the same even for the same customer. If i cache some of the results then i need to go through the entire list to filter applicable results.

    4) One day Latency is acceptable for any update in this portion of Customer data. Taking a daily feed of all the data at some point of time in the day is something worth thinking about? Anyway this will require we replicating the schema our side.

    5) Directly accessing their DB (Thru a local DAO) makes their DB transparent. Is that ok?
    Being a Central Repository of Customer Information it is like a Datawarehouse. What kind of interfaces do Datawarehouse applications give to their clients?

    Threaded Messages (8)

  2. Performance Considerations[ Go to top ]

    Hi S,
    3) Should i cache some data. But the query results are not the same even for the same customer. If i cache some of the results then i need to go through the entire list to filter applicable results.

    You can cache the entire Customer dataset using Coherence's Partitioned Cache which equally partitions the cached dataset across all nodes participating in the cluster. For scalable queries you can use our Distributed Query technology which allows for programmatic querying of the properties of the objects in the cache. This is equally as scalable as Coherence itself since the data and hence the "query" are partitioned across the cluster.

    Later,
    Rob Misek
    Tangosol, Inc.
    Coherence: It just works.
  3. 2+3+4[ Go to top ]

    My 2 cents:

    1. DAO ...may be better than EJB but won't improve performance much. Still object overhead.

    2. Looks good but needs to be applied in combination of 3 and 4 as below.

    3. Since, 1 day latency is acceptable -- cache 1 day old data to query against. It must be in your local DB.

    4. Best bet for performance... since your requirments are read-only you can only replicate views of the required tables.

    5. not much of idea.


    Combinig all these:
    Locally store(cache) 1 day old data - overnight, no performance implication.
    Use resultset/rowset for passing data efficiently.


    Hope that helps!
    Nitesh
  4. Performance Considerations[ Go to top ]

    But a new requirement demands certain infomation from the Customer repository that is very huge.(10000*15) records per query.80000 queries per day.

    How much columns are per row? Are you completely sure that, every time you access to the Customer Information, you are going to need to read 150k rows ???

    If for example, you only have 10 field on every row, you are going to read every time 1,5 million columns. If every column is - for example - an alphanumeric field of size 10, this implies that every time you access to the Customer Information, you are going to request from the dabase 15 megabytes of information...

    Of course, this information must be sent through the network, from the database to the application server, and then this 15 megabytes of information must be converted in Java objects (and of course, this objects are going to be then garbage collected).

    My advice is that you reconsider this design: if you are going to perform 80000 queries every day, and every query is going to move 15 megabytes of information, then probably (IMHO) there is something wrong in the design of your application.

    Why don't you filter the information in Oracle (for example, using a PL/SQL) and then you send the information to the application server ??


    Jose Ramon Huerga
    http://www.terra.es/personal/jrhuerga
  5. Performance Consideration[ Go to top ]

    Rob,
    Querying of the Objects in Cache sounds good.
    I need to look for Efficient In-Memory Data management.
    Will have to do some study about this.

    Jose,
    Ours is a Message Oriented Transaction processing system.
    Each customer has a few Patterns associated with them. We need to check the Incoming message for any patterns that are associated with the customer involed in that transaction.
    Certain clients who do a lot of business with us have a huge number of patterns associated. In general an average of 500 to 1000 patterns per customer.
    80000 Transactions per day. We cannot move this pattern searching to the database. That we need to perform in a seperate process(JVM/App Server).

    I Infer the follwoing:

    1) Have a local replica of Data Every Night.
    2) Use DAO returning resultset.
    3) No need of Object Encapsulation, Iterate over a result set
    4) Have a Least Frequently Used In-Memory cache for caching all info for a customer
    5) Adopt a strategy to Query In-Memory Objects...

    Should u have any further thoughts do share.

    Thanks for all ur replies
    ST
  6. Take a look at the Sun's
    ValueListHandler core J2EE pattern. The forces in your application seem to be a good fit - too many records to materialize in the client or to move over the wire at once, need for caching records in the server tier, and the need to cleanly seperate concerns.

    The example implemetation that Sun provides in it's Core J2EE patterns book (2nd ed., Alur, et al) is uses POJOs, but they say that you can implement the main value list handler as a stateful session EJB.

    Also - if you couple this with some efficient SQL in your DAO (such as is described here) you should be in good shape.

    Cheers, Michael
  7. Very Interesting Scenario[ Go to top ]

    (10000*15) records per query, 80000 queries per day.

    Wow!. Now thats some traffic.

    Considering the amount of data that is being pulled, it does not sound like a real time processing system. That kinda data is just bizarre. If you have a system waiting for such amounts of data via RMI/IIOP or XML/HTTP it would be really challenging.

    Let us keep the design patterns such as DAO, Value List Handler aside for a moment. Even though you can cache with frameworks, does that really solve the problem in a clean and elegant way? Now we are talking about replicating not cache but state. Do you need failover too? I can understand caching using fancy frameworks for small amounts of data but caching such huge sets of data in a clustered environment makes me nervous. There must be an easier way.

    How about this

    1. Consumer fires a request and does not wait for a response
    2. Request gets into a message queue
    3. Data is queried and results saved in a format that your consumer needs (most likely XML), compressed as a record in the DB.
    4. Consumer is notified of the existance of his/her query results.
    5. Consumer accesses a separate application server/middleware component. So that one application can concentrate on accepting requests and other for download of humungous amounts of data.

    This would mean that the consumers listen for messages.

    The advantage here is that we cache the formatted and compressed responses and not necessarily the stale data that we want to query real time. We do this at our own sweet time and since they can wait for 24 hours anyway, this will give you ample time to query.

    I would suggest staying away from fancy XML parsing as much as possible.
  8. Very Interesting Scenario[ Go to top ]

    I can understand caching using fancy frameworks for small amounts of data but caching such huge sets of data in a clustered environment makes me nervous.

    Yes, I agree with that. Althought probably you may find commercial cache systems that can handle huge amounts of data, my personnal perception is that they are designed to work with small amounts of data.

    I would suggest staying away from fancy XML parsing as much as possible.

    You may only parse small XML documents (less than 10 kbytes). If you parse all the time big XML documents (more than 100 kbytes.) you are going to have performance head-aches...



    Jose Ramon Huerga
    http://www.terra.es/personal/jrhuerga
  9. Very Interesting Scenario[ Go to top ]

    Hi Jose,
    I can understand caching using fancy frameworks for small amounts of data but caching such huge sets of data in a clustered environment makes me nervous.
    Yes, I agree with that. Although probably you may find commercial cache systems that can handle huge amounts of data, my personnal perception is that they are designed to work with small amounts of data.

    In the past this may have been true but Coherence is specifically designed to cache (or manage) _enormous_ amounts of data in a clustered environment. Some customers are testing caching TeraBytes of data (can't wait to be on-site to see that datacenter ;)). Coherence is able to provide this type of functionality by partitioning the entire cached data set equally across the participating cluster nodes (automatically _and_ transparently) complete with fault-tolerance based on a configurable number of backups. By partitioning the data evenly, the per-port throughput (the amount of work being performed by each server) remains constant, hence linear scalability.

    Later,
    Rob Misek
    Tangosol, Inc.
    Coherence: It just works.