Need some advice regarding, loading huge set of data onto memory


Performance and scalability: Need some advice regarding, loading huge set of data onto memory

  1. Hi,

    I need some suggestions regarding loading huge set of data on to Application server side memory.

    In my application, I am making call to some reference table in a loop, because of this, we are making many i/o operations to the database.

    To avoid many i/o operations to the db, can we get all the reference table data(Approx. 5 million records, each record contains 35 cols, the total memory size of 5 million records would be 100mb) into apps server memory interms of vectors?? the vectors are in a class file and that class file will be instantiated in the init() of a servlet. Is it good idea to keep all the data onto apps server memory. Right now, we have 512mb memory available in the heap.

    The project is in beginning phase. Once if it is in full production, evantually users will increase in 1000s, Approximately, i would say
    10,000 users will be there for now.

    I would like to get some suggestions from u, regarding this.

    my email id is c_golapar at verizonwireless dot com or arunagk@hotmail.som

    Thanks in Advance.

    Aruna G
  2. This is usually not the best approach. It really doesn't scale very well. On the other hand, if you know your data size, if you know your user size, and you know your machines can handle it, fine.

    Generally speaking, databases are better at managing and retrieving large sets of data than servers are. That's what they're for! My inclination would be to look at optimizing your DB for your read-only tasks through indices, denormalized tables, etc., and look at getting native drivers and such. You should be able to get sufficient performance that way, and it'll scale better.

    On the other hand, if you can afford the memory, it may be easier and faster to cache it all in memory. You might also look at app servers that cache for you.

    Ben Flaumenhaft
    Principal, Sidelight Consulting
  3. You got a very very bad idea.

    Not to mention that you may want to redo your calculation. Unless you store less than 1 byte per column (maybe you store two columns encoded in one byte) I don't see how you got the 100 MB figure.

    You have to come back with more details (what do you mean by loop, looping through 5 milion of records ? how often ? in what circumstances ?).

    And unless you want to bump your head in the wall, you may want to rethink your application.
  4. Hi,
    i also have very bad sense from this yours way to solve problem with huge set of data in table.

    This solution will ( with good machine ) work, but if you'll have requests to operate on whole set of data , like re-ordering, searching in multiple columns etc.., i am affraid you'll be not so efficient like some good configured DB.

    Some solution is that you'll have to chache not all data on init. But you'll keep cache for most often used data ( or whole requests ). We solve serveral problems in this way, and we got good results.

  5. What you need to look at are Object Caches / O-R mapping tools like Javalin or Toplink or Cocobase. I havent done any evaluation of any of these to be able to give you any valuable advice - no doubt, others can.
  6. Hi,
        1. Let us not go by assumptions for they may put the whole project under risk.
        2. Please do some more initial study based on the performance requirments..
        3. Please get these details from the client - scalability expected, latency tolerated, response time, connection speeds, database, and the application server you are planning to use for production, not for development!!!!.
        4. I agree completely with the opinions of others who replied earlier to me. Design is always a compromise, but not at the cost of the project. Please try to use caching with great caution.
         5. Another idea that came up to my mind is whether you can use an intelligent caching, where you keep frequently used records in the middleware cache and if the hit fails in the cache use the datsbase conn and get from there. The logic for this cache could be improved as you go on and you could achieve a decent hit-miss ratio by the time you go for production. The cache should be dynamically updated. Is this approach suits your requirement?
    Experts please comment on this and correct/enhance with better ideas.

    Vamsi Kalyan Poondla
  7. Hi,

    Thanks for all your suggestions. Here once again, I will try to explain, what i wanted and what
    i am doing.

    For an Analysis, I have to refer 4 tables. Those tables are referred in a loop. The size of the loop
    varies depends certain scenario. Assume, the loop size is 200.
    For Loop -- 200 times.

    get values from the 4 tables for a particular value.
    do the business logic.
    store the final value in an array.

    loop ends

    Now, When I am doing like this, the number of i/o operations made to the db is 800. becasue of this,
    it is taking time to get the analysis done.

    For an analysis, it is taking 7 to 10 seconds. If it is multiple analysis, then it is taking
    around a minute of time.

    The whole set of records for all the 4 tables is 5 millions.

    Here is the question,

    I have to bring the results within 5 seconds for multi line analysis.

    Once again thanks for all your thoughts.

    Aruna G
  8. Ok,

    So if I understanda correctly you have to access roughly 200 loops to read values from 4 tables, calculate a value based on business logic and update or insert in some other table . (what do you do with the array ??)

    Anyway, to load the whole set in memory is not quite good enough.
    It may be a valid approach in special circumstances, but it requires a lot of time and effort put into implementing
    advanced algorithms.

    Whereas a database has these algorithms for free.
    So my suggestion is to put your business logic in stored procedures (so you don't have those 800 db calls through the JDBC driver), then spend time tuning database (in terms of cache size and table indices).

    I assure you a tuned database WILL NOT make 800 I/O ops for 800 rows processed, and it can handle these kind of processing quite easily.

  9. I agree with Costin that you have to first look at using a Stored Procedure. If, for any reason, you still have a problem with performance, then you probably have to go with the "load in memory" approach. Good luck. - Srini...
  10. Putting your business logic into a stored procedure is a bit of a DBA's approach to system design - I would avoid using stored procedures unless it is desperately necessary and unless it is very simple logic. (Equally this viewpoint could be said to be characteristic response of a non-DBA ;-) )

    I would whole-heartedly agree with you that a stored proc (in this case) might be the cheap and cheerful fix.

    However, it would depend on how static or dynamic this large chunk of data is (how often does it change).

    Where a stored proc wont work (and where an obect or data cache will) is when you want to do a bit of fiddling with the data. You dont want to change this data permanently (commit it to database) but you want to "play" with it in RAM. A common approach to real-time risk calculation for trading apps provides a service "what would my exposure be if I did this...".
    Here a stored proc is not the option.

    I have seen more scenarios than I would like where tons of business is logic trapped in a legacy database. It is neither very portable or very understandable at all. And when you want to change database vendor.....

    My advice is:
    1) If it can be easily stored-proc'ed AND you are comfortable with that - then do it.

    2) Most appservers have some sort of start-up class feature (ie the ability to load some other class but a J2ee one - if your data is purely static, then do this.)

    3) Explore the use of Object Caching tools such as Toplink, Javalin or Cocobase (of these, the only opinion I have been given is that Cocobase is pants for performance)

    4) Investigate the use of In memory databases such as TimesTen.