General J2EE: huge data model...... is caching a good strategy?

  1. Hi,

    We have a huge data model - a tree which would exceed 2,00,000 nodes. The good news is that it is immutable and is common for all users of the application.

    I want to know the best strategy to hold this data.
    1. If we hold the entire object in memory, the user will be able to browse the tree pretty fast.
    2. If we cache say only the root and one level of children, everytime the user drills down into the tree, it would mean fetching details from the database and re-displaying the tree.

    I am leaning towards the first option, but then i dont really know wht kind of performance implications this has.
    Any light on this......

  2. You are talking about a really nasty caching problem. If you cache all your data, you will get faster responses, but you will use up memory in your server, which will slow down other functions.

    Most caching mechanisms compromise by caching data using weak memory references that can be garbage collected if the server is running out of memory. This gives you the best of both worlds: caching data most of the time, but freeing up memory when the server needs it for something else.

    Getting this right is a tremendous programming task. Before you implement your own caching system, I strongly recommend that you look as existing tools. Every major EJB server on the market will do data caching for CMP entity beans. Most Object-to-Relational mapping tools will also do data caching. Look at OJB and Hibernate (in the Open Source world) or Toplink (for a commercial tool).
  3. hey - thanks for the quick reply
    The thing is, we are not using EJB - just goold ole servlets n JSP.
    Due to the specialist nature of the application, it would need to support very few users at any point of time (maybe even lesser than 15).
    Its originally a client server system.
    What would you advise in such a situation?

    p.s. Open source is not an option here, and I really doubt that the company would buy a special product for this caching need.
  4. Well if you cannot buy a tool or use open source, you will have to build it yourself. Given the low usage requirements of your application, data caching it still probably the best bet.

    If you are going to build your own data caching system, I suggest you use lazy-loading to navigate your tree. That way you cache data "Just In Time" and avoid loading the entire data tree in memory during server startup.

    The easiest thing to do is put lazy-loading and data-caching directly into your object model, in methods like:

    public synchronized NodeSet getChildNodes() {
      if (this.cachedChildNodes == null) {
        this.cachedChildNodes = // ... Load child node data;
      return this.cachedChildNodes;

    If your object model is already fixed, you will have to be more clever, but the same basic tricks should work.

    By making you methods synchronized, you can safely store your data cache in the application context (aka ServletContext) rather than the user session, so you will only have one copy shared by all of your users.

    Note that all of the above assumes that your data is relatively static (changes rarely). If that is the case, you simply can dump the cache from memory on a regular basis (e.g. once per day) to refresh the data. If the data changes more frequently, then data caching may not be an effective approach.
  5. Thanks to all for your great inputs.
    In light of our situation - no Open Source, No purchase :( - i think the best option would be lazy loading.

    Thanks again!
  6. Try a distributed cache?[ Go to top ]

    We had a similar scenario in our application and we found that by using a localized cache (NearCache) backed by a much larger distributed cache we were able to achieve the performace our users were looking for.

    "A Near Cache is a local cache that sits in front of a distributed cache. In simple terms, it is a cache of a cache. A near cache is a read- through/write- through cache that delegates all operations that it cannot directly handle to the under lying distributed cache. For example, if a cache read occurs and the near cache contains the requested data in its local cache, it returns that locally cached data, otherwise it delegates the cache read to the distributed cache." - http://www.tangosol.com/coherence-featureguide.pdf

    With this approach, the data that the user used most frequently, its always accessible at in-mem speeds.

  7. THat is an interesting approach.
    Did your company buy a product that supports near caching, or did your team actually implement this?

  8. THat is an interesting approach.

    > Did your company buy a product that supports near caching, or did your team actually implement this?
    > thanks

    We bought Coherence from tangosol.com. Honestly, its probably the best thirdparty product we have ever bought. It has save us countless hours and the performance of our application is just amazing. This is defintely an extreme example, but last week we added a coherence cache to a legacy application and we reduced our database hits by over 50%.

    You can get a developers license from them for free to try it out.

  9. Try a distributed cache?.. interesting[ Go to top ]


       As per your posting you are planning to hold some 200k nodes ??????? ummm isnt it something like you are giving say student table dump as first level of tree and later the user can dig into courses attainded, trainings done, exams taken and all that kind of information.

      My personel opinion will be it is a problem in design if u have to hold 200k nodes :). sorry to be so blunt but thatz not the way it has to be done. But ofcourse my knowledge about the problem is pretty limited so letz not go in that.

       What I will like to suggest is, to go for lazy caching. the concept is, once the user hits your tree you expose only the root level nodes.

       Let the user drill down and reach the final leaf destination. Whatever path the user has followed keep the loaded stuff in cache. Now if another user or same users goes into different traversal then you cache only the the newly displayed traversal along with the old cach.

       In general when the user is using a application with 200k nodes on tree. It is highly possible that his day to day traversal path will be limited to 10 to 15 root nodes. considering that all the 15 different users are having no shared traversal paths then 15 x 15 = 225 traversal paths will be in cache.

       Advantage with this solution is you are loading frequently visited paths in cache and hence after first visit the performance is pretty high. secondly as you are not loading full node list but only the viewed list the memory blockage is also at low level.

    hope this helps