Discussions

News: Give your DB a Break: Using Caching for Speed and Availability

  1. Learn how caching data in front of the database can allow for faster running, and more available applications. In this article, Dion Almaer looks at clustering and caching strategies, using a distributed cache, read-through/write-behind caching, and technologies that integrate nicely into a distributed caching architecture such as JDO, JMS, and JNDI.

    Read Give your DB a Break

    Threaded Messages (88)

  2. Very Timely Article...[ Go to top ]

    Very nice. The thread below about Rod's interview was dripping with sarcasm and disbelief that an entity bean strategy is useless.

    Thanks for the read, Dion.

    Best,

    John C. Dale
  3. Correction...[ Go to top ]

    'is useless' should be 'is viable'.
  4. Very Timely Article...[ Go to top ]

    Not to bug you, but distributed cache and auto fluashing on update is transparent and easy and built into iBatis, and you do not need to change the DAO or do any of this work.
    .V

    > Very nice. The thread below about Rod's interview was dripping with sarcasm and disbelief that an entity bean strategy is useless.
    >
    > Thanks for the read, Dion.
    >
    > Best,
    >
    > John C. Dale
  5. Very Timely Article...[ Go to top ]

    Very nice. The thread below about Rod's interview was dripping with sarcasm and disbelief that an entity bean strategy is useless.

    >

    It must be possible to implement forum without clever EJB caches,
    KISS is a good way for scalable applications too.
    This is a very nice forum http://forum.hibernate.org/index.php, trust me it is implemented without entity beans :)

    > Thanks for the read, Dion.
    >
    > Best,
    >
    > John C. Dale
  6. Very Timely Article...[ Go to top ]

    Nonetheless...we are talking about the story that was presented - this site is implemented using a cache and ENTITY BEANS.

    I know its possible to do it both ways. I'm also glad it is a free country.

    Best,

    John C. Dale
  7. Hibernate caching[ Go to top ]

    Hi Juozas,

    As an aside, Gavin did publish a Tangosol Coherence plug-in for Hibernate.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Shared Memories for J2EE Clusters
  8. Hibernate caching[ Go to top ]

    It was about PHP (hibernate forum implementation) not about Hibernate caching. It doe's not have TSS.com scalability problems :)
  9. This article mentions JSR 107 and Tangosol Coherence.

    I also know of Apache Turbine's JCS.

    What else is available? What are people's experiences?

    thanks,
    james
  10. I have written plugins for using both Tangosol Coherence and OpenSymphony's OSCache in distributed environments for Jakarta's Object/Relational Bridge project. Both worked very well. Coherence has a lot more features, but also costs money. OSCache is functional and easy to use (uses JavaGroups for its comms).

    I had a lot of problems using JCS in that environment.

    Jason McKerr
    Northwest Alliance for Computational Science and Engineering
  11. Slightly Biased But...[ Go to top ]

    I have to admit a bias as I used to work for them, but Isocra's livestore has to be worth a look. The transparency of the thing is a big, big win for me.

    No doubt someone who works there now will be along in a second to sing its praises.
  12. Gigaspaces?[ Go to top ]

    Gigaspaces (gigapsaces.com) also seems to be a decent product to achieve distributed caching. Does any one have any experience with this product?
  13. Gigaspaces?[ Go to top ]

    Just to update you all:
    GigaSpaces provide Cache implementation for Hibernate.
    It is implemented on top our Distributed caching facility.
    We have several customers using this very successfully.

    See full info:
    http://www.hibernate.org/201.html

    Best Regards,
            Shay
    ----------------------------------------------------
    Shay Hassidim
    Product Manager, GigaSpaces Technologies
    Email: shay at gigaspaces dot com
    Website: www.gigaspaces.com
  14. JSR 107 Alternative - SpiritCache[ Go to top ]

    Have a look at Spiritcache from Spiritsoft - it complies with JSR 107. It uses JMS for all movement of data from/to data sources and between caches. Caches load data either from the data source or from other caches. Using JMS for transport means it is reliable & scalable and uniquely caches can "subscribe" to data they have an interest in. It supports both cache read and synchronized cache write-through. E-Trade is a big user.

    Regards

    Andrew
  15. Object Caching using JCS[ Go to top ]

    I am using Java Caching System (JCS) for object caching in a web portal application. The framework I have created is pretty basic in nature. It is used to cache the data retrieved from an EJB application server and stores the objects in Tomcat servlet container. I think JCS is a good caching framework for object caching needs. I haven't looked into OpenSymphony's OSCache. Is OSCache also based on JSR 107?
  16. JCS LTCP works ?[ Go to top ]

    I get a Socket write error, when JCS attempts to comunicate with other's local cache's. Did this happend to you ?
  17. JCS LTCP works ?[ Go to top ]

    I get a Socket write error, when JCS attempts to comunicate with other's local cache's.
    We were never able to get JCS working in a cluster.

    If you need clustered caching, check out our Coherence software. If you are looking for an open source framework (a la JCS) then check out OSCache at OpenSymphony.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  18. Hi Nitin,

    Thanks for the article. Here is a good paper on using distributedcache by Kyle Brown.

    http://www.eaipatterns.com/docs/distributedcacheupdate.pdf
  19. Cache Service[ Go to top ]

    Hi all,
    Shoud I bind JCS caching service as in Avalon service or Struts plugin or is it more convenient to bind the service as an MBean in Jboss MX so I can use it anywhere and not only inside the Web layer?
    thank u

    Faisal
    UK,Bham
  20. Error in the article?[ Go to top ]

    The article says:
    The db-is-shared magical flag has to be turned off if you fall under any of the conditions below:

    You probably mean "on", right?

    --
    Cedric
  21. db-is-shared[ Go to top ]

    Hi Cedric -

    The db-is-shared tag probably shouldn't be set to be "true" unless you are running one instance, and you really know what you are doing.

    So the following statement is basically saying, if you fall under those conditions, then you can't keep db-is-shared as true anymore. This happened to us as we moved from a single instance of WLS, to the cluster of application servers.

    "The db-is-shared magical flag has to be turned off if you fall under any of the conditions below:

        * You want to run more than one instance of the application
              o E.g. using WebLogic clustering
        * Other programs access the database behind the back of the application
              o Outside processes
              o DBAs sneaking around changing things"

    D
  22. db-is-shared[ Go to top ]

    From the edocs

    http://edocs.bea.com/wls/docs60/ejb/EJB_environment.html#1047000

    "Restrictions and Warnings for db-is-shared

    Setting db-is-shared to "false" overrides WebLogic Server's default ejbLoad() behavior, regardless of whether the EJB's underlying data is updated by one WebLogic Server instance or multiple clients. If you incorrectly set db-is-shared to "false" and multiple clients (database clients, other WebLogic Server instances, and so forth) update the bean data, you run the risk of losing data integrity.

    Also, due to caching limitations, you cannot set db-is-shared to "false" in a WebLogic Server cluster.
    "

    It needs to be "true" in a cluster by definition. I've had opposite experience with BMP Entities synching and clusters. (We tried "false" in a cluster and had all sorts of integrity problems pop up under regression testing)

    - F
  23. db-is-shared[ Go to top ]

    Dion:
    > Hi Cedric -
    >
    > The db-is-shared tag probably shouldn't be set to be "true" unless you are running one instance, and you really know what you are doing.

    Uh, double negatives. Headache.

    Actually, it's the opposite. Your sentence should read:

    "should be set to true unless WLS owns the database"

    Clearer?


    > "The db-is-shared magical flag has to be turned off if you fall under any of the conditions below:
    >
    > * You want to run more than one instance of the application
    > o E.g. using WebLogic clustering
    > * Other programs access the database behind the back of the application
    > o Outside processes
    > o DBAs sneaking around changing things"

    Sorry, still think you have it backwards. The flag should be set to "true" in all these cases.

    By the way, it's been renamed "cache-between-transactions" in the more recent releases (more explicit this way).

    --
    Cedric
  24. Seems to be a directionless article[ Go to top ]

    The article seems to advocate distributed caching without taking into consideratons the performance cost of different approaches.

    1. The author has narrated that he doesn't recommends default implicit caching by container as the container is confused of background updates. But has not given any resource on that how distributed caching engines will get to know of parrellel background updates.

    So as it stands, neither the container nor the third party solution will get to know of background updates directly in the DB (DBA moving around).

    The solution in the EJB specifications is to recommend modelling of only those tables as Entity Beans which are not to be accessed directly by other sources as this directly has bearing on distributed caching by container. So a third party caching doesn't seems to help anyway here. (Use SB+DAO for fast lane reader there).

    2. Any generic caching product requires extensive synchronization in order to preserve the data consistancy.

    A container is able to synchronize to the extent of individual row of the DB(represented by one instance of the Bean). (VERY EFFICIENT IN ALL LOAD SCENARIOS)

    Whereas any external solution will result in all beans instances for that table / application being queued up in synchronized blocks of the caching API. This may not get noticed in low load but will have drastic impact due to increased resource contention as the transaction load increases. (TRAGIC IN HIGH LOAD SCENARIOS)

    My 2 cents
    Regards
    Aman Aggarwal
  25. good points[ Go to top ]

    Aman -

    You have brought up many good points.

    This article was meant to be a broad article to discuss some scalability issues, and how distributed caches can help the situation. The examples that were used are from particular use cases, but there was some discussion as to what scalability means in different scenarios.

    When designing your application you have to carefully consider your particular requirements... [see the Scaling Application Performance and Availability areas], and based on these you can design your data architecture.

    There is definately no silver bullet. You shouldn't always use a distributed cache, however they can be very useful in many situations, which is what the article has discussed.

    Also, implicit caching by the container isn't necessarily a bad thing at all. In some situations it works great (it did for us at TheServerSide before we went to a clustered architecture).

    The background update issue is a separate issue, and one that definately needs to be addressed in many situations. The distributed cache doesn't solve this issue. Since in many of our designs the cache is the king, then we don't have to worry about it much, as everyone goes through the cache. Most of the good caching solutions have various technologies to help with cache invalidations (Seppuku) etc etc.... but this is most definately something that needs to be thought of and addressed.

    Dion
  26. Seems to be a real-world article[ Go to top ]

    Aman: Any generic caching product requires extensive synchronization in order to preserve the data consistancy. A container is able to synchronize to the extent of individual row of the DB(represented by one instance of the Bean). (VERY EFFICIENT IN ALL LOAD SCENARIOS) Whereas any external solution will result in all beans instances for that table / application being queued up in synchronized blocks of the caching API. This may not get noticed in low load but will have drastic impact due to increased resource contention as the transaction load increases. (TRAGIC IN HIGH LOAD SCENARIOS)

    I can't speak for any other solutions in the market, but the distributed synchronization issue that you highlighted is largely solved in high-scale environments by using a partitioned cache (with additional options for version-based concurrency, transactions, etc.) The "synchronized blocks" issue that you mentioned doesn't exist at all in Coherence (and I'm talking about dozens of SMP servers under serious load with hundreds of threads each).

    If you want any predictability of scale, better scalable performance, higher availability and amazing cost savings to boot, you can't beat clustered caching.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  27. db-is-shared[ Go to top ]

    Cedric -

    Ok, yes. You are correct :) I will fix the article... thanks!

    Dion
  28. DB Phobia Syndrome?[ Go to top ]

    As an architect who is deeply involved with both database and J2EE technologies, I found it perplexing that a large part of the J2EE community seems to be suffering from a "DB Phobia syndrome" i.e. choose to ignore or plain refuse to consider some good and mature database technologies as part of a solution.

    Caching is a case in point. I'm sure most architect knows that data caching is available in most if not all commercial databases. However most J2EE oriented architect choose not to consider it at all. I quote from the article: "If possible, I try to stay away from having to cluster the database machines". My question is why not? I sure hope its not because this solution is not in the J2EE domain.

    Another DB phobia syndrome phenomenon in the community I observed is the use of SQL. Many in the community seems convinced that smart developers who are well versed in complicated OO technologies cannot master the skill of writing good efficient SQL.

    I'm excited at how J2EE has come along in the last 2-3 years as a standardized middle tier platform but with regards to data access and database, I cannot help wonder if technology as a whole has regressed..
  29. DB Phobia Syndrome?[ Go to top ]

    As an architect who is deeply involved with both database and J2EE

    > technologies, I found it perplexing that a large part of the J2EE community
    > seems to be suffering from a "DB Phobia syndrome" i.e. choose to ignore or
    > plain refuse to consider some good and mature database technologies as part
    > of a solution.

    Amen! One of the other strange things you keep seeing once a while is huge caches being engineered to cache lookups that would take a couple of milliseconds and using databases that often reside on the same boxes. People than tell you that this "boosts performance". Another frightening thing I keep seeing in a lot of delivered code is absolutely inefficient queries to the database. Running ten queries where more often than not one could have done the job with very minor changes to the application architecture and then the desire to "boost performance with caching".

    That said I agree that caching can have some very valid points. For example if you chose to use an object relational product, I'd always prefer one that has (distributed) caching built right into it.
  30. Amen[ Go to top ]

    "One of the other strange things you keep seeing once a while is huge caches being engineered to cache lookups that would take a couple of milliseconds and using databases that often reside on the same boxes."

    Actually, from my experience, there is a time and place for caches too. Especially when the database is not on the same box.

    Of course, as always, "premature optimization is the root of all evil." I.e., don't jump into coding a cache until you can prove that you need one.

    And, since we're at the subject of database design, a cache is most definitely no substitute for competent coding. Entirely too often just rewriting a bit of code to use a single SQL select instead of 10 would boost performance far more than any cache will. Or I've actually been in projects where merely tuning the prefetch size or batching commands provided a huge performance boost, but a cache wouldn't have done any good at all.

    I've seen entirely too many cases of clueless design which starts with the class design and has the database design at best as an after-thought. So you end up, for example, with a design which is very nicely modular and well separated into separate DAOs for the client, address, the invoice, for the product object, and a few others...

    ... except it runs like crap at the end, because (as you've said), it ends up generating a flurry of querries: one for the client, one for the address, one for the invoice, one for each and every single item on it, and so on. (_Really_ stupid design will go even further, and split this even further. E.g., for each item you don't have just one query, but more than one. E.g., retrieve the item name in one query, and the manufacturer name in another query, and the price in a third query.) When it would have been possible to retrieve all that in one, or at most two queries.

    Here's an idea for these types: since most of the delay the user sees is spent in the database... why not start your design with _that_? As opposed to going straight into phantasy land with some (good looking on paper) class design that abuses and mis-uses the database's capabilities?

    It's just about as stupid as starting designing a car by ignoring the need for an engine until the last moment. Sure, now we have a very nice car design. Too bad it can't go faster than 5 mile per hour, eh? Now let's design a cache for it.

    Bonus points are awarded for not considering the database design _at_ _all_, not even at the end. Just throw that class design which is already utterly unfit for the database into some JDO or EJB mapping tool, and let it generate the database part. Then wonder why it runs slower than a snail. Well, duh, because no matter how sophisticated that mapping tool may be, it can't turn lead into gold. A design which was utterly and horribly unfit for what the database will do, will still be utterly unfit after being run through such a tool.

    Double bonus points go for buying into the XML hype and using it instead of a database. They have a perfectly good relational database, yet the retards store whole XML files as LOBs in it. Effectively mis-using a database as a funky file server. Then instead of being able to select the rows they need with a simple select, they have to read and parse thousands of those LOBs and throw away those who don't match the criteria. Brain damage at its finest, but 100% buzzword-compliant.
  31. DB Phobia Syndrome?[ Go to top ]

    "a large part of the J2EE community seems to be suffering from a "DB Phobia syndrome" i.e. choose to ignore or plain refuse to consider some good and mature database technologies"

    I completely agree. There are so many people out there with the attitude that the database is some kind of external data dump, not really a concern of the developer.

    Another phenomenon similar to that is developers who don't build their own application: they write ejbs and somebody else builds the whole thing and deploys it for them.

    It makes very hard to 1) tune and 2) debug an application when each developer is locked into a single tier.
  32. DB Phobia Syndrome?[ Go to top ]

    As an architect who is deeply involved with both database and J2EE technologies, I found it perplexing that a large part of the J2EE community seems to be suffering from a "DB Phobia syndrome" i.e. choose to ignore or plain refuse to consider some good and mature database technologies as part of a solution.

    >
     
    Some of us to prefer to store garbage like forum messages in transactional RDMS, it is more safe, is not it ?

    > Caching is a case in point. I'm sure most architect knows that data caching is available in most if not all commercial databases. However most J2EE oriented architect choose not to consider it at all. I quote from the article: "If possible, I try to stay away from having to cluster the database machines". My question is why not? I sure hope its not because this solution is not in the J2EE domain.
    >
    > Another DB phobia syndrome phenomenon in the community I observed is the use of SQL. Many in the community seems convinced that smart developers who are well versed in complicated OO technologies cannot master the skill of writing good efficient SQL.
    >

     SQL is a legacy, entity beans and new big application servers is a solution for all problems, everybody knows it, read books and articles on inet, we need N tier architectures.

    > I'm excited at how J2EE has come along in the last 2-3 years as a standardized middle tier platform but with regards to data access and database, I cannot help wonder if technology as a whole has regressed..
  33. DB Phobia Syndrome?[ Go to top ]

     SQL is a legacy, entity beans and new big application servers is a solution for all problems, everybody knows it, read books and articles on inet, we need N tier architectures.


    Man, you're definitely wrong. Of course, OR mapping solutions proved their value in reality, but! They are not applicable everywhere. I think you were joking.
  34. DB Phobia Syndrome?[ Go to top ]

     SQL is a legacy, entity beans and new big application servers is a solution for all problems, everybody knows it, read books and articles on inet, we need N tier architectures.

    >
    > Man, you're definitely wrong. Of course, OR mapping solutions proved their value in reality, but! They are not applicable everywhere. I think you were joking.

    Looks like my message is very realistic on this site :)

    I do not know the best way for data access, but old plain SQL and RDBMS is a very good way for OTP. But I am sure RDBMS optimized for transactions It is not the best way to store and seach forum messages, It must be possible to implement forum with this kind of features using plain lucene index files and to run it on sole PC.
  35. DB Phobia Syndrome?[ Go to top ]

    This is the story of the six wise men and the elephant. The elephant in this case is the data. The joke is that everyone is arguing about what the subject is, but noone is looking at the same part of the beast.

    The true architect is one who knows the apppropriate technology for each data situation, and how to deploy it in a way that is scalable and extensible.

    > > >  SQL is a legacy, entity beans and new big application servers is a solution for all problems, everybody knows it, read books and articles on inet, we need N tier architectures.
    > >
    > > Man, you're definitely wrong. Of course, OR mapping solutions proved their value in reality, but! They are not applicable everywhere. I think you were joking.

    Whoever said SQL is a legacy technology is showing their ignorance. SQL is the defacto database language and will be around longer than we will. SQL is still being improved and extended, and RDBMS servers are getting faster and better all the time.

    >
    > Looks like my message is very realistic on this site :)
    >
    > I do not know the best way for data access, but old plain SQL and RDBMS is a very good way for OTP. But I am sure RDBMS optimized for transactions It is not the best way to store and seach forum messages,

    That's right: if your data is "transactional" instead of "seldom updated, frequently read", you should go straight to the dataserver, whether its through an entity bean or a passthrough method that calls a parameterized stored procedure. If the data is seldom or never updated (eg., list of cities and their zip-codes), then go ahead and cache it locally.

    >It must be possible to implement forum with this kind of features using plain lucene index files and to run it on sole PC.

    There is a bit of truth that most RDBMS's are horrible text management systems. They are great at two things: Queries that involve tabular data (under 2K per record), and Transactions.

    Go to Wall Street, and ask them what they use to handle their transactions. The big surprise is: Sybase Adaptive Server. (or Oracle or MS SQL Server, the first cousin of Sybase).

    The truth is there are different servers for different needs. Anyone who tries do do it all with just RDBMS, or just App-Servers is naive.
  36. DB Phobia Syndrome?[ Go to top ]

    That's right: if your data is "transactional" instead of "seldom updated, frequently read", you should go straight to the dataserver, whether its through an entity bean or a passthrough method that calls a parameterized stored procedure. If the data is seldom or never updated (eg., list of cities and their zip-codes), then go ahead and cache it locally.

    >


    Doug,

    Can't this be described as: If you don't really need the D in ACID?
    Because I was in situations where the database was used to keep temporary data, but everybody felt that they still needed transactions.
  37. DB Phobia Syndrome?[ Go to top ]

    I guess that depends on what you mean. I think you are asking whether "distributed cached" data could be stored on the application server without going to the database server. That would entail a way of ensuring that all reads of that data must be consistent and "serializable".

    As I understand it ACID is:

    Atomicity - All changes take place or none at all (rollback)
    Consistency - Transactions run on a consistent view of the data
    Isolatable - Transactions run in isolation from each other
    Durable - Commited transactions stay commited in case of a system crash

    If you give up durability, you might not be able to recover committed transactions, which is no good. I would instead say that you could run at a lower level of isolation, (eg., dirty reads) keeping "lookup" data in the app server.

    As for truly "temp" data, its usually private to the transaction itself, so it doesn't matter if it goes away in the middle of a transaction, right? So long as the transaction itself is backed out of the database.

    I guess the point I would venture to make here is that certain data types used in transactions lend themselves to being store in a single place: the database server. distributing them and cacheing them is asking for trouble. But for other types of data which are seldom updated and frequently read (such as email threads), it might make sense to use one of these distributed cache's.
  38. DB Phobia Syndrome?[ Go to top ]

     SQL is a legacy, entity beans and new big application servers is a solution for all problems, everybody knows it, read books and articles on inet, we need N tier architectures.


    "There is no silver bullet".
       [Frederick P. Brooks, JR.]
  39. Don't forget the DB :)[ Go to top ]

    Harold -

    The database is a very important tier in development. I hope you don't think that this article is trying to say "don't use the DB features" as that is *not* the case.

    I do not believe that the DB is just a data container. I do think that some developers are scared of nice features like triggers/stored procedures etc. It often does make sense to use these features, and we shouldn't be scared to do more in the database for performance (we can abstract these behind DAOs etc too, so the business objects don't even know).

    However, even with all of these tricks, I have found that using distributed caches are a great way to scale the data tier (above and beyond scaling the DB itself). Many applications DO NOT NEED TO SCALE like this.... but large enterprise systems often DO.

    Dion
  40. Don't forget the DB :)[ Go to top ]

    I think the point is to use application-layer cache strategies as a way to avoid unnecessary *reading* of seldom updated data (mostly lookup info). This could have the advantage of freeing the database server to handle transactions: somehting relational database servers do well. Its hard to imagine a high throughput transaction system that did it all in the cache (at least not if they expect to be able to recovery from a crash).

    What most people dont realize is that the relational-database-server has a very sophisticated and optimized transaction logging system, capable of handling thousands of transactions per second (tps). (See www.tpc.org for proof). Since ultimately changes to data has have be recorded in the database, there will be some overheads, and the transactions have to wait until a log record can be written. The best backend scenario for scaling involve multi-processor database servers, and RAID 0+1.

    I would like to know how does the application entity bean (Tangosol Coherence) do it's locking and guarantee degree-three consistency (repeatable reads) or serializable transactions..?

    By the way, this is an interesting discussion. I think its important to realize that in the data modeling phase (if people are still doing it these days) you must classify the durability and recovery characteristics of your data elements. This of course means that you need to review your transactions carefully. One isolation level does not fit all needs.

    Thanks,

    Caveat: I am an RDBMS expert using Sybase.
  41. DB Phobia Syndrome?[ Go to top ]

    Harold: Caching is a case in point. I'm sure most architect knows that data caching is available in most if not all commercial databases. However most J2EE oriented architect choose not to consider it at all. I quote from the article: "If possible, I try to stay away from having to cluster the database machines". My question is why not? I sure hope its not because this solution is not in the J2EE domain.

    Actually, it's a lot simpler than that. When you have two J2EE app servers and one database, maybe the database can keep up no problem. But if you add another 20 J2EE app servers, or another 50, how's that database doing? The database quickly becomes a single point of bottleneck. Even when it is not a bottleneck, a cache can be orders of magnitude faster, and keeps the database freed up for important work, like transactions. Further, the database becomes a single point of failure, and guarding against that (and still having it scaled up) becomes ridiculously expensive. On an average "high end" J2EE system, I'd suggest that at least 50 cents out of the infrastructure dollar (and probably a lot more than that) goes to the database infrastructure, and these are the companies buying those mythical $90k J2EE licenses.

    So generally speaking, when people talk about letting the database do it all, they aren't talking about high-scale systems. At least not data-intensive ones. I'll ask for some numbers from the customer I'm visiting today, but IIRC they dropped their page time from multiple seconds to under 20 milliseconds by using caching instead of going direct to a monster SQL database. And that was before the load was applied ;-). So their application became both more performant (faster response times) and more scalable (supports more users with the same amount of hardware) by using caching.

    So, it's not "db phobia", it's simply wise planning to not send every car in town through the same rotary, regardless of how big and expensive the rotary is.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  42. DB Phobia Syndrome?[ Go to top ]

    "But if you add another 20 J2EE app servers, or another 50, how's that database doing?"

    I think one should be able to run a 20-node database cluster. If your application can run on twenty nodes it means there is very little data sharing.
  43. DB Phobia Syndrome?[ Go to top ]

    Explain to me why I should use a cluster of 20 Oracle servers at $1 million+ price tag when I can use a cluster of 20 jboss servers, 1 Oracle server and caching for a tenth of that cost?

    All too often in enterprise deployments, the database license is the biggest cost. Buying a $10k license to Coherence vs a $100k license for more Oracle nodes just makes sense.


    > I think one should be able to run a 20-node database cluster. If your application can run on twenty nodes it means there is very little data sharing.
  44. DB Phobia Syndrome?[ Go to top ]

    Explain to me why I should use a cluster of 20 Oracle servers at $1 million+ price tag when I can use a cluster of 20 jboss servers, 1 Oracle server and caching for a tenth of that cost?


    You may not need to. It all depends on whether you feel paranoid enough about your application's transactional behavior that you don't feel comfortable improvising on it. I would say that for an online bookstore having a specific isolation level is probably not that important. What does your application do.
      
    > All too often in enterprise deployments, the database license is the biggest cost. Buying a $10k license to Coherence vs a $100k license for more Oracle
    nodes just makes sense.

    I think the biggest cost is probably development, not the database license, at least over the life of the system (I guess it depends on the system, I am not really thinking about message boards here.)

    I think with a database cluster you get some benefits that you don't get with a middle tier cache. But not everyone may need them. You know what your application does, so I am not suggesting that you stop doing what you are doing, I am just saying that if you can afford a database cluster, then it's a cleaner solution. The license cost will be much higher, and you will spend a bit more on hardware, but it may save you money in other areas, including development costs.
  45. DB Phobia Syndrome?[ Go to top ]

    I think one should be able to run a 20-node database cluster. If your application can run on twenty nodes it means there is very little data sharing.


    To run a 20-node DB cluster efficiently you need to partition your database. There is no database than can cluster 20 nodes efficiently in a shared disk cluster.
    AFAIK, about 8 nodes is a practical limit, above that synchronisation overhead is too high.

    Mileta
  46. DB Phobia Syndrome?[ Go to top ]

    To run a 20-node DB cluster efficiently you need to partition your database. There is no database than can cluster 20 nodes efficiently in a shared disk cluster.

    > AFAIK, about 8 nodes is a practical limit, above that synchronisation overhead is too high.

    I assume that you know what you are talking about but if you can expand I'd love to hear more details. Synchronization of what? Since you are talking about shared disk (we were talking about caching) I assume that you are talking about the raid array becoming a bottleneck?

    As far as network communications go, if the workload has a lot of share data then Oracle RAC will perform *much* better than an object cache, simply because of the different technologies involved (UDP vs. RDMA).

    Cameron was talking about 20 nodes. Obviously an application that runs on 20 nodes has little data sharing or pays the price for it (which may be alright, depending on the app). Personally, I don't think it's a good idea to run 20 or 50 nodes. If you do, then maybe the machines are too small and could be replaced with slightly larger SMPs for simpler administration. I have only known one system which ran 40 nodes, and that was barnes & nobles' web site. They ran 40 NT machines (this was 5 years ago I think.)
  47. .[ Go to top ]

    "they dropped their page time from multiple seconds to under 20 milliseconds by using caching instead of going direct to a monster SQL database. "

    Wow, that sounds like a poor query. But, I get your point and agree with it - you can't compare a read from memory with round-tripping to a database, executing a query, and returning a resultset, there's just no comparison. We're currently using the technique of putting a cache object on the JNDI tree for situations where it's almost always read-only, and that seems to be working really well, but it's not transactional of course.
  48. performance difs[ Go to top ]

    Tracy: Wow, that sounds like a poor query. But, I get your point and agree with it - you can't compare a read from memory with round-tripping to a database, executing a query, and returning a resultset, there's just no comparison. We're currently using the technique of putting a cache object on the JNDI tree for situations where it's almost always read-only, and that seems to be working really well, but it's not transactional of course.

    I highly doubt it ;-) ... not with the countless millions that these companies spend on their databases, Oracle profiling tools, well-trained DBAs, fiber networks, massive clustered SMP database servers, etc.

    Look, even in a best-case scenario, it takes time to talk to a database. You have to communicate across the wire. Typically, the communication is SQL, which has to get parsed, or at least has to obtain a latch on a parse cache or a prep'd statement cache. Then the database has to do its work, even if pulling from cache. Then the answer has to come back. Oracle is extremely effective at doing this stuff, yet it still can't cut its real world latencies below 10ms, and a lot higher if the data isn't in cache, and way higher for 2PC, or if you have an external tx manager, etc.

    In a "typical" scenario, the database server is handling thousands (or more) of requests a second. It's busy. Average "read only" latencies rise into the hundreds of milliseconds. The caches in Oracle get so busy that your requests line up in spin locks waiting to access the caches. Why? Because developers use the database, and they don't notice when they are testing with "1 virtual user" (themselves) that its inefficient to utilize the database for everything. Change it to 10,000 (or more) real users, and you see a disturbing picture.

    It's not a question of clustering Oracle or not. Companies like their data to always be available. They will cluster Oracle. They'll run redundant networks. They'll have multiple data centers with failover. They'll spend tens of millions just planning it and setting it up. But they won't want to drag their tens-of-millions-of-dollars 99.999%+ HA database network to its knees putting together HTML web pages. They'll cache at the content end ... look at Vignette for an example: it becomes a dog if its cache hit rate drops below 99%. They'll cache in their web farms. They'll cache in their JSP tier. They'll cache in their Servlet container, and in their EJB tier. And of course they'll cache in the database. And with all that caching, they can achieve some serious scalability. And I've never heard one of them say "let's turn off all the caching except the database caching because databases cache well." That would be suicide.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  49. Times[ Go to top ]

    "Look, even in a best-case scenario, it takes time to talk to a database. "

    You won't get an argument from me about caching, I love it and do it all the time. I typically use the technique of putting a JNDI object on the tree for mostly-read-only data.

    And you're right, it does take time, and that time depends on the network. I recently did some tests and to execute a query that basically did nothing, it took 16ms to go from wl8.1 out to Oracle, execute a minimal query, and return me a result set. THat time was very repeatable. Of course, a hashmap lookup probably couldn't be measure in ms.
  50. performance difs[ Go to top ]

    Look, even in a best-case scenario, it takes time to talk to a database. You have to communicate across the wire. Typically, the communication is SQL, which has to get parsed, or at least has to obtain a latch on a parse cache or a prep'd statement cache. Then the database has to do its work, even if pulling from cache. Then the answer has to come back. Oracle is extremely effective at doing this stuff, yet it still can't cut its real world latencies below 10ms, and a lot higher if the data isn't in cache, and way higher for 2PC, or if you have an external tx manager, etc.


    1) If the data isn't in the cache then it won't be in the object cache either.
    2) If an XA transaction is involved the overhead will be present for the object cache as well.

    I am not too worried about the latency. The reason is that I am thinking about the case where you are pushing your system far into the latency-throughput curve. *Any* system, at its maximum throughput, will have higher latency. So you'll end up with high latency anyway. If anything, the disadvantage is that you'll get lower throughput because of greater CPU utilization, but then you just buy one more cpu per node and you are probably back where you started.

    > In a "typical" scenario, the database server is handling thousands (or more) of requests a second. It's busy. Average "read only" latencies rise into the hundreds of milliseconds. The caches in Oracle get so busy that your requests line up in spin locks waiting to access the caches. Why? Because developers use the database, and they don't notice when they are testing with "1 virtual user" (themselves) that its inefficient to utilize the database for everything. Change it to 10,000 (or more) real users, and you see a disturbing picture.

    I still think that using RDMA you can scale way beyond the regular single-node database. The picture is no longer disturbing. In addition you may be able to save drastically on hardware by buying several small machines rather than one big one.

    Pfister says in his book that the communication overhead for DB2 on a sysplex with a workload where the data is *fully* shared is about 15%. That means the scalability is very nearly linear. It might be worth a thought.
  51. DB Phobia Syndrome?[ Go to top ]

    As an architect who is deeply involved with both database and J2EE technologies, I found it perplexing that a large part of the J2EE community seems to be suffering from a "DB Phobia syndrome" i.e. choose to ignore or plain refuse to consider some good and mature database technologies as part of a solution.

    >
    > Caching is a case in point. I'm sure most architect knows that data caching is available in most if not all commercial databases. However most J2EE oriented architect choose not to consider it at all. I quote from the article: "If possible, I try to stay away from having to cluster the database machines". My question is why not? I sure hope its not because this solution is not in the J2EE domain.
    >
    > Another DB phobia syndrome phenomenon in the community I observed is the use of SQL. Many in the community seems convinced that smart developers who are well versed in complicated OO technologies cannot master the skill of writing good efficient SQL.
    >

    Amen and hallelujah! I've been in the industry 20 years and recently (4 years ago) retooled to Java and J2EE. I am amazed beyond belief how little the client/server web app crowd knows about good online transactional, pseudo-conversational code and database practices. It's as though the industry had to rediscover what we were doing 20 years ago with COBOL and CICS containers. Witness the publishing of books like Bitter EJB and Rod Johnson's best practices book for J2EE - both good books, but for crying out loud, what an indictment of the industry! What the hell were these people doing in college that they ended up writing such horrible code? And why didn't the profs at these schools tool these guys with these skills?

    Clustering a database is a superb way to protect for failover and frankly, most good databases have had good caching strategies in place for decades. I'm demanding it be done for our curently deployed web application. Frankly, the developers of databases are a lot smarter at writing caching strategies than most other developers around and if most J2EE folks would get their ego out of the way, they could see the same.

    I agree any competent developer should have good SQL skills as well - and some databases have some very powerful and useful extensions. To not use them appropriately is. . .poor customer service.
  52. What about this ?[ Go to top ]

    http://www.alzato.com/ndb.html

    "NDB Cluster is a parallel main-memory
    database server product that enables
    applications to have real-time access to
    data in client-server mode, while guaranteeing
    continuous availability. For the
    developer, the NDB Cluster provides a
    relational database view with location
    and replication transparency."

    Can we consider this a relational interface (instead a Map) to clustered cache ?

    PS : I am not biased. Alzato was buyed by MySql group.
  53. Thanks for the nicely-written article, which helps people to be aware of the issues in developing a cachign solution, either in-house or using 3rd-party products.

    Does the author (or anyone else) have some experience to share regarding the following issues related to caching?
    1. java object contention: if a java.util.Map like construct is used for the storage of the cached object, what would be the consideration in contention on the java.util.Map implementation object? In a read-mostly scenario, the java.util.Map object will be updated occasionally. However, to ensure the integrity of the object, a (straightfoward) cache implementation will need to synchronize the read operations as well. When the read operations are syncrhonized, what is the consideration in its impact on synchronization? Or is there other strategy to avoid synchronization on read operations?

    2. list (rather than lookup): Map interface is natural (and can be optimized) for lookup but not for listing. An example could be list the last 50 active threads of a forum, rather than retrieving the details of a specific thread (or message).
    If items are put into java.util.Map like cache, how does the application implement list type of functionalities? Will list type of functionalites be implemented with direct database call?
  54. What about object contention and list?[ Go to top ]

    I can not answer for other products, so the following answers apply to Tangosol Coherence specifically:

    > 1. java object contention: if a java.util.Map like construct is used for the storage of the cached object, what would be the consideration in contention on the java.util.Map implementation object? In a read-mostly scenario, the java.util.Map object will be updated occasionally. However, to ensure the integrity of the object, a (straightfoward) cache implementation will need to synchronize the read operations as well. When the read operations are syncrhonized, what is the consideration in its impact on synchronization? Or is there other strategy to avoid synchronization on read operations?
    >

    Coherence implements a very efficient distributed and scalable concurrency API for its caches to deal with this very concern. This allows you to synchronize read-for-update operations across the cluster easily and make sure that you always get the correct values when it counts (assuming all your logic follows the same logical access pattern, namely: lock/read/update/unlock).

    > 2. list (rather than lookup): Map interface is natural (and can be optimized) for lookup but not for listing. An example could be list the last 50 active threads of a forum, rather than retrieving the details of a specific thread (or message).
    > If items are put into java.util.Map like cache, how does the application implement list type of functionalities? Will list type of functionalites be implemented with direct database call?
    >

    For the situations where you need to search through the data loaded into the cache, Coherence provides Distributed Query API, which allows you to easily and efficiently, using all the distributed computing resources available to the cluster, find cached objects by their attributes. So in the example you give, as long as there is an attribute for thread activity (let's say "ActivityScore") for the thread, you can execute a distributed query, returning "activity scores in a certain range in descending order limiting the number of objects returned to 50". You can also create indexes on the most frequently used attributes to speed up you queries.

    While this is not a replacement of the database functionality by any stretch, it does greatly simplifies cache usage and provides excellent scalable capability for group access to cached data.

    Best Regards,
    Alex Gleyzer
    Tangosol, Inc.
    Coherence: Cluster your Work. Work your Cluster.
  55. It does sound like you are re-inventing the wheel for which RDBMS vendors have done for decades.
  56. What about CMP EEJBs?[ Go to top ]

    I see how nicely a caching solution like this can fit over BMPs. We have an application that intensively uses CMP entity beans and as the number of users for this application increased, the more problems we had. It is an intranet application which in 2 years had grown to support more than 10.000 users and the more users came in, the more problems we had. We run JBoss and we went to a painful process of identifying the most-used entity beans. Those were the ones that caused deadlocks and made the AppServer to even exit completly sometimes (when the critical mass of concurent users was reached).
    Using CMPs has it's certain advantages, one of which is productivity but it sometimes lead to performance problems we may not consider at the begining.
    My conclusion is this: if you do not design for clustering, caching and high availability in the first place (we didn't), there may be some good practices that will only do good:
    Using value objects from the very beginning will prove effective for certain. Also, the use of a Delegation layer on top of the EJB layer leads to a single point of accessing the value objects. I'm not saying this to look smart. I had to refactor the whole application and replace hundreads of direct calls to the bean layer, with calls to the delegation layer.
    We understood only after a painful exercise the advantages of this layer: not only it easies the whole refactoring process during later dev. and maintenance stages but it also makes it easy to think to a caching mechanism later on by looking at the delegation layer mostly.


    My 2cents
    florin
  57. What about CMP EJBs?[ Go to top ]

    Florin,

    > We have an application that intensively uses CMP entity beans and
    > as the number of users for this application increased, the more
    > problems we had.

    > .. Those were the ones that caused deadlocks and made the AppServer
    > to even exit completly sometimes

    Well, did you try another AppServer? Too bad you blame CMP for what actually looks like a performance, scalability and stability issue in the product you were using.

    I've often said this in the past - but some (many? ;) developers seem to bad-mouth and blame the technology for what is essentially a shortcoming in a product.

    -krish
  58. If you've had a CMP project that actually scaled beyond 5 concurrent users, chances are it wasn't using much in the way of container managed _relationships_. I.e., it probably wasn't an enterprise problem to start with.

    Something like a bulleting board for a niche site is easy to do well in CMP. (Well, or rather still wasted money, brains and time, since a plain old JDBC solution is still easier to code, faster to execute, and easier to maintain. But at least it's not _too_ much waste to get an extra buzzword.)

    "Benchmarks" which just read 10,000 records sequentially are even easier to get right in CMP. If all you do is a "select * from benchmark_table", yeah, the container can't botch it _too_ much for you. Which, I suppose, is why they're everyone favourite "proof" that CMP scales and is OK.

    Now try a complex query, where you have to retrieve a whole tree, via those relationships. Eep. Doesn't scale that well any more, does it?

    And let me explain why it'll _never_ be on par with doing the same select directly over JDBC. You see:

    1. Reading and discarding records is cheap when you do it directly in the database server. Even the dreaded full table scan is essentially limited only by the disk bandwidth and seek time. On the other hand the app server has to do round trips over the network to get the same data. Unless your data is already in the cache (and you'll never have a 40 million record database completely in the cache), the delays in reading those records over the network can quickly add up to be crippling.

    But maybe the app server is smart enough to be doing the exact same querries for you, as you would do over JDBC, so you only get the caching and no penalty? Not so:

    2. The very way the Entity Bean spec and cache are defined conspire against that possibility. It works on whole records.

    Let's say you really need a query like "select A.foo, B.bar, C.baz from A, B, C, D where A.field1=B.field1 and B.field2=C.field2 and C.field3=D.field3". Easy enough, right? Do you think that the container can possibly just run something like that on the DB server for you, and just conveniently cache the result? Think again.

    There's no way to cache whole records while retrieving only pieces of them. (That is, assuming you could even ask for only the fields, and not the whole objects, anyway.) I.e., the container must literally retrieve whole records from A, B, C, and D, in separate queries. And _then_ figure out how to combine and filter them, duplicating functionality that the database server already had. Only, again, it's doing it over queries and over a network connection, instead of by directly accessing the local disk (as the DB server does.)

    3. Incidentally, for the very same reason, it makes poorer use of the available RAM. A database server can do a lot of funky things, such as caching only the query results, or the relevant indexes or whatever. It does not have to masquerade as a cache of _objects_.

    4. A database is easy to tune by a competent (or even semi-competent DBA). Stuff like partitioning, views, indexes, hints, etc, can be optimized on the fly very easily. Doing the same on an Application Server is generally not even possible, and nowhere near that easy.

    Which brings us all the way to the conclusion:

    5. database technology is a mature, proven technology. Oracle and IBM have _decades_ of experience with making a good responsive database server. They have mountains of RDBMS code which is refined and improved with every release. Why not make use of that experience, instead of trying to avoid it?

    Whereas Entity Beans are just the latest software hype, and a nice buzzword to have on the resume, but they're really an exercise in what's really an anti-pattern: reinventing a less powerful, stable or scalable version of a wheel which already existed. They're trying to emulate something which already existed, but in the end they're nowhere _near_ as performant or powerful as what the database server already offered. _Any_ application server I've seen is, by comparison to a real RDBMS, at best experimental code.

    Would you bet your client's money just to support some unneeded experiments? Well, in that case may I interest you in my patented Snake Oil (TM) Enterpries Edition screen saver? Only 10,000$ per CPU. And, hey, you get the warm fuzzy feeling that it's not Microsoft's .NET
  59. Cristian,

    Your argument seems to be centred around the fact that CMP is the proverbial silver bullet for persistence and data access. Its not, and probably never will be. The same applies to any persistence technology or framework around today. They're all good for certain tasks and not so good for others. And you'd want to select the right technology for the right kind of job and more importantly, the right one for your problem domain. (When you have a hammer, every problem looks like a nail ;)

    But to answer some of your points. There is an easily available CMP|CMR project that has scaled well beyond 5 users with hard-set response times. It's the specJAppServer2002 benchmark that primarily tests an AppServer's handling of EJB 2.0 with CMP & CMRs. And in the last 4 weeks, I've been on customer visits running performance metrics on customers running Borland Enterprise Server (I work for Borland). In 3 out of the 4 customers I visited in the last month, CMPs & CMRs were being extensively used, and the least loaded application at one of these customers ran on a twin-CPU Solaris box (the AppServer) handling an average of 1200+ concurrent web users. And before you ask, yes, it was a sizeable application with a database size of ~7GB, around a 300 EJBs, of which 250+ were CMP 2.0 and over 400 JSPs. And yes, there were stored procedures, triggers and direct JDBC to complement CMP.

    Of course it does not make sense to pull huge graphs of objects out of your database, process them in your business logic tier and then send across a processed subset to your client tier. Whether you do this with CMP/JDO/JDBC/Insert_Your_Acronym in the business logic tier does not matter. You shouldn't be doing it.

    And I'm all for processing at the database tier if it makes sense. You're not paying all those $'s to Microsoft/Oracle/IBM to get essentially a dumb data store. And I'm all for caching solutions. You just cannot achieve massive scalability without a well-thought out and implemented caching solution at different tiers/layers.

    All I meant to say in my previous posting was that when people conclude technology "X" sucks, or you never should be doing "Y", have either (i) Never used it, or (ii) Used it wrongly or in a way it was never intended to be used, or (iii) worked with a product with poor support, performance and scalability for "X" or "Y".

    -krish
  60. "Of course it does not make sense to pull huge graphs of objects out of your database, process them in your business logic tier and then send across a processed subset to your client tier. Whether you do this with CMP/JDO/JDBC/Insert_Your_Acronym in the business logic tier does not matter. You shouldn't be doing it."

    and end up with stored procs. What is huge? 3-5 dynamic plus up to 10 static/readonly (could be cached) we need always. Should we sacrifice maintainability for that? I think not.

    Could Borland AS prefetch for CMP in one join? I know only WLS that can do it.
  61. Whether you do this with CMP/JDO/JDBC/Insert_Your_Acronym in the business logic tier does not matter. You shouldn't be doing it."


    You should have said "Whether you do this with CMP/JDO/JDBC/IYA (Insert Your Acronym) ..." ;-)
  62. When you say you still need a good DB design, and "And yes, there were stored procedures, triggers and direct JDBC to complement CMP", you're basically saying that you did your homework.

    My problem is precisely with people who don't. Those who, as you've put it, paid big bucks to Oracle or IBM just for a dumb store.

    They start with the view that "everything is an object" (no, in the end it will be a database record), and that "all the world is Java" (nope, you still have SQL somewhere), and that "persistance isn't our concern, we'll design our nice class hierarchies completely ignoring whether the objects go into a database, or in serialized classes on the disk, or in XML files." (Nope, you don't need such generality in any actual project.) Even if you don't see the above phrases explicitly written down as such, there are plenty of projects which are designed that way. When you see a project where they just skipped the mapping completely, and just serialized the whole objects into a generic BLOB field (or wrote it as XML into a BLOB or CLB), you tell me if it's not made with these assumptions in mind.

    Running a whole project through a JDO mapping product or some CMP generator wizard does not produce serialized BLOBs, but it's a symptom of the same "the database is not our problem" mentality. And it will still run like a dog. And just layering caches on top of a rotten design, instead of fixing the real problem, is just another symptom of the "all the world is Java" mentality.

    As I've said, there _is_ a time and a place for caches, and I've used caches to good effect myself. But I wish people didn't take it as a generic substitute for good design, and good use of what the database already offers. Basically: first make sure that you use the database right (e.g., that you don't generate a flurry of 20 queries, where one complex join would be enough), and only _then_ decide if you need a cache or not.
  63. That said, you make the perfect case of why CMP is a waste of money, brains and time anyway. In fact, you make the case better than I could. Let's consider what happens in the typical project.

    Joe Coder starts by mapping everything to CMP and CMR. This already takes more development time than just plain old using JDBC and storing the data in a data object yourself.

    Then he has to go through more loops, by:

    - writing more interface classes. For example the famous stateless session facade, because using the Entity Beans directly from the client is slower than a dead snail.

    - in the process, he typically ends up extracting the data from the CMP bean, and packing it up in his own data object, and then sending that to the client. Or viceversa. What a waste of time. Couldn't he just take the data by JDBC into that object from the start?

    - in the process, he also typically runs into a general problem of EJBs: complete lack of support for inheritance. If I extend a data object in the client, I can't just pass the subclass to the server. It won't be able to de-serialize it. So essentially we're back in time to using a dumb C style struct, instead of classes. (Having data _and_ an extensible behaviour that's intrinsic to the object is the whole idea behind OOP, you know. The moment you're forced to have pure data objects and pure processing classes -- e.g., stateless session bean facades -- you're back in time to C programming.) It also means more copying and/or wrappers and interface classes written in the client, to deal with this.

    - but wait, it still runs like a dog. Let's do more tricks, like duplicating some of the functionality in pure JDBC shortcuts. Oh great. Can you say, "code duplication"? Yep, I knew you could. And you probably know why that's an anti-pattern. Now you have both the JDBC _and_ the CMP/CMR versions of the same queries to maintain, and to keep in sync. Plus now you need to know two APIs well, whereas in the "old fashioned" approach just one API would have been enough.

    - hmm, it's better, but it still doesn't meet the expectations. Let's build our own cache layers. Wasn't the whole idea of CMP precisely to provide its own transactional cache, and spare me the effort? Once I'm back to doing this by hand, why use CMP in the first place?

    - not bad, but still lacking. Let's move some stuff into stored procedures after all. Now that typically results in changing tens to hundreds of files. (Including all those facades, and data objects, and deployment descriptors, and everything else.) Whereas the straight JDBC solution could well involve 10 times less effort to change.

    And so on, and so forth. I could go on for pages.

    Basically, you may have seen the end result at those clients, but I'm thinking you haven't seen the massive waste of man-months that went into getting it there. Yes, the "scale beyond 5 clients" was a bit of a hyperbole. The fact that it takes more effort (and thus money) _and_ yields worse performance, is not.

    Now I've already come to terms with the idea that you can sacrifice performance, if you save development time and money. (E.g., why we're writing this stuff in Java, instead of assembly.) But sacrificing performance to end up paying _more_ development time and money is still a surrealistic concept to me. The great vision of the future is to end up... worse _and_ more expensive? If that's not being a fashion victim, I don't know what is.

    And one more word: I keep hearing arguments like yours, which run along the lines of "yeah, well, if it didn't work for you, maybe you didn't do it well, or not using the right product." You know what that tells me? It tells me that there's an inherent risk in choosing this technology. I've done direct JDBC to quite a few databases, including Oracle and DB2, and it never ended up with "but maybe you did it wrong", nor with "but maybe you have the wrong server." It just worked as intended. Plain and simple.
  64. Yes, Risk Matters[ Go to top ]

    And one more word: I keep hearing arguments like yours, which run along the lines of "yeah, well, if it didn't work for you, maybe you didn't do it well, or not using the right product." You know what that tells me? It tells me that there's an inherent risk in choosing this technology.


    I completely agree with this statement. Risk matters. It's generally better to choose a technology whose properties are predictable than a technology which fits the job to be done to a 'T', but you don't know how many weeks of optimization it will require. Furthermore, in many cases an architecture will look right in release 1 of a project, and look really stupid in every version after that, simply because important choices were made on the basis of circumstantial requirements which changed right away.
  65. Cristian: If you've had a CMP project that actually scaled beyond 5 concurrent users, chances are it wasn't using much in the way of container managed _relationships_. I.e., it probably wasn't an enterprise problem to start with.

    CMP can suck, but doesn't have to. WebLogic, for example, does a pretty good job with it, including with respect to relationships. You can't have a newbie designing the system, though ... (as I was when I built my first WebLogic CMP application, that brought a 4-CPU Oracle server to its knees with only half a dozen concurrent users!)

    Cristian: Something like a bulleting board for a niche site is easy to do well in CMP. (Well, or rather still wasted money, brains and time, since a plain old JDBC solution is still easier to code, faster to execute, and easier to maintain. But at least it's not _too_ much waste to get an extra buzzword.)

    You are allowing your personal opinions to cloud your judgement. Our customers often render fully dynamic pages faster than a JDBC driver can perform one simple query. As a result, it is impossible for a pure JDBC solution to be faster. (That said, check out Isocra and their caching JDBC driver. Cool stuff.)

    Cristian: "Benchmarks" which just read 10,000 records sequentially are even easier to get right in CMP. If all you do is a "select * from benchmark_table", yeah, the container can't botch it _too_ much for you. Which, I suppose, is why they're everyone favourite "proof" that CMP scales and is OK.

    Wow. I am disagreeing with everything you are saying .. I promise I am not usually this disagreeable ;-)

    This is the worst kind of benchmark for CMP. You should never (IMHO) use entity EJB for anything like this. Talk about "every problem begins to look like a nail"!!! Processing (outside of the database) more than a few entity bean instances within a transaction (which EJB access typically represents) is not what entity EJBs were intended for, IMHO. They are transactional representatives of back end data. That's it. It's not about "data access", but about "data transactions". If you want data access, check out JDO (Solarmetric Kodo or Hemtech JDOGenie come to mind since I know them from our partnerships) and also Hibernate. Or use Spring or go at JDBC directly.

    Cristian: Now try a complex query, where you have to retrieve a whole tree, via those relationships. Eep. Doesn't scale that well any more, does it? And let me explain why it'll _never_ be on par with doing the same select directly over JDBC. You see:
    1. Reading and discarding records is cheap when you do it directly in the database server. Even the dreaded full table scan is essentially limited only by the disk bandwidth and seek time. On the other hand the app server has to do round trips over the network to get the same data. Unless your data is already in the cache (and you'll never have a 40 million record database completely in the cache), the delays in reading those records over the network can quickly add up to be crippling.


    Yes, you are right. It is relatively cheap in the database, at least for a single user. I used to use stored procedures to build n-level trees in Sybase / MSSQL with only n+1 queries .. very, very quick .. even with hundreds of thousands of rows. The problem is when there are more than a handful of users doing the same thing dynamically (e.g. different inputs). The database server quickly becomes a bottleneck, even when the tables are pinned into cache memory.

    As far as 40 million records in the cache, why is that not possible? You're only talking about tens of gigabytes. Partition it over a couple servers in a cluster. (I'm not suggesting that one implementation choice is superior to the other, but rather that your assumption about what is easily possible today is badly outdated.)

    Cristian: 2. The very way the Entity Bean spec and cache are defined conspire against that possibility. It works on whole records.

    Yes, but as I mentioned earlier, entity EJB is about transaction, and should not be used as some magic "data access" layer for read-only operations. It is not an OODBMS.

    Cristian: 4. A database is easy to tune by a competent (or even semi-competent DBA). Stuff like partitioning, views, indexes, hints, etc, can be optimized on the fly very easily. Doing the same on an Application Server is generally not even possible, and nowhere near that easy.

    Coherence load balances the partitioned cache dynamically and transparently to the application. As servers come up or go down, it continuously adjusts for load without losing any data. Again, your assumptions are out of date.

    Cristian: 5. database technology is a mature, proven technology.

    Absolutely. And J2EE technology, including clustering and caching, is already mature and proven in the market. I can't think of a major financial institution that doesn't absolutely depend on it, for example. (I am reminded of a company I visited a few years ago whose J2EE software had processed over USD$11 trillion in transactions the year before. You'll have a hard time topping that.) If you really want mature and proven technology, you should stick with CICS/RPG/COBOL on the mainframe; to those people, Oracle and DB2 are just unproven toy programs that you run on little toy computers that have a nasty habit of crashing every couple of years. So stop with the false superiority thing and just use the best tools for job at hand.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  66. You are allowing your personal opinions to cloud your judgement. Our customers often render fully dynamic pages faster than a JDBC driver can perform one simple query. As a result, it is impossible for a pure JDBC solution to be faster. (That said, check out Isocra and their caching JDBC driver. Cool stuff.)

    You will note that I wasn't comparing straight JDBC to cached JDBC. And as I've said before, I've used caches before, when they were needed. So you'll get no arguments from me there. Yes, if you have already done your design well (including a good data model design, where the database isn't an after-thought), ran your load tests and your bottleneck really is the database, sure, go ahead and use a cache. You don't even have to code your own nowadays. Or use a caching JDBC driver.

    What I _am_ arguing is straight JDBC vs Entity Beans. (Just to make it clear, again: when saying "straight" I'm including the cached versions. Via custom caches, or via caching JDBC drivers, or by whatever other means.) Sorry, I keep hearing about mythical projects where pure CMP ruled the day, but I've yet to actually see any. What I do see in practice is stuff where a CMP based project actually cost more manpower to implement, and still lags in perfomance.

    Were newbie coders to blame for that? Maybe. Maybe not. I know of at least one scrapped project here, done by consultants from a BIG corporation. Allegedly they were seasoned EJB experts, not newbies. It needed a big cluster of computers to even come close to what our non-EJB solution did on a single less powerful machine. The project was still a buggy beta, more than a year after the promised delivery date, and then it was scrapped.

    On the other hand, newbies can easily do a plain JDBC implementation. So what's the big advantage of CMP then? Needing more time, more expensive consultants, more hardware, and more risks, to deliver the same thing?

    Absolutely. And J2EE technology, including clustering and caching, is already mature and proven in the market. I can't think of a major financial institution that doesn't absolutely depend on it, for example. (I am reminded of a company I visited a few years ago whose J2EE software had processed over USD$11 trillion in transactions the year before. You'll have a hard time topping that.)

    Yes, I know a lot of people use J2EE. Guess what? In most cases it doesn't mean CMP, nor even EJB. The servlet specification, JSP and JDBC _are_ parts of J2EE. I.e., a straight JDBC app (with or without caching JDBC drivers) _is_ a J2EE app.

    I also know a lot of people who _don't_ use J2EE. In fact, I'd be hard pressed to think of any bank, since you mention financial institutions, where their main financial systems are Java based at all. At best, they might have a Java servlet facade for their web site, but behind it, where the money is really handled, chances are there'll be a big iron machine without any Java programs on it.

    If you really want mature and proven technology, you should stick with CICS/RPG/COBOL on the mainframe; to those people, Oracle and DB2 are just unproven toy programs that you run on little toy computers that have a nasty habit of crashing every couple of years.

    I don't program in COBOL, but what's wrong with it? If it runs their programs well, why shouldn't they stick to it? Basically I hope this isn't just more of the "yeah, but we've got all the newest buzzwords, therefore we're obviously superior and the right solution" attitude. As you say, "just use the best tools for job at hand". In their case, maybe the right tool is precisely to have 1 guy maintain that old but still functional COBOL program, instead of paying millions to convert it to the newest buzzword.

    It may not be a major financial institution, but I do know of one import company, doing business worth a few tens of millions of dollars per year, where moving from their old RPG programs was obviously a mistake. Sure, it's not 11 trillion like in your example, but it is, nevertheless, betting their business on their computing system. They're already at their second attempt at getting a more modern system, and it still doesn't work right. It should have been ready before the Y2K deadline, but what they have now in late 2003 is still a buggy unfinished beta that needs a lot of manual intervention to keep it running. Whereas that old unfashionable RPG system just worked.
  67. I think you are making assumptions based on the minimum of what the J2EE spec requires. Many application servers go way beyond that. WebSphere (and some other app servers) will cache CMR data so it does not have to repeatedly go to the database. It can also do "read ahead" to prefetch data for related EJBs with one query. There is a dynamic query service that can take complex EJB QL queries (with multiple CMRs )and push them down to the database so the DB can do the selection/sorting/aggregation of data and return only the relevent data, in the proper order. The EJB QL implementation goes beyond the spec to allow multiple data fields to be returned from a query, so you choose whether you want objects or data returned, and in what order.
    Most of these optimizations are configured at deployment time, imposing no additional work or complexity on the developer.
    My experience is WebSphere but I know other J2EE app servers offer similar optimizations. EJB 2.0, as implemented in the high end appservers, can be used very productively to build applications which perform and scale very well. This has nothing to do with trying to be "fashionable". This is about using the right technology for a particular application. Obviously, there are some applications for which EJB is not appropriate.

    I work for IBM but the opinions are my own.
  68. What about CMP EJBs?[ Go to top ]

    krish,
    far from me blaming the CMPs or the entity beans! I am a big fan of the EJB architecture and I have used it with more-than-better results in the past 3 years. When the EJB 2.0 went out, I said to everybody that it was one of the best pieces of technology I have ever studied! The application that I am referring to, uses roughly about 100 entity beans (all CMP) and a good number of session beans.
    I came to know some intimate aspects of this technology and I have been privileged because the applications I have helped build, were meant for big customers with large user bases.
    I wanted just to point out that the caching technologies discussed in Dion's article (very good one I may say) may not be fit for some applications, especially those with the dev. cycles frozen.

    >Well, did you try another AppServer?
    Yes, we are currently roll-out a Weblogic deployment and another one using the (less known yet) SAP WebAS container from SAP AG. It's too soon to draw any conclusions, the ugly problems allways appear when certain 'singularity' conditions are met and definely when the customer is doing a critical and non-postponeable operation :)

    Some time ago I wanted to propose on TSS a discussion in which the developers of big J2EE systems (big number of concurent users, lots of resources involved, etc.) could share their thoughts, testimonies if you like: How different AppServers hold under stress, what happens with the response times after a number of hours of runtime, memory profiles, how to tune resources, stuff like this. You will hear about these problems only from developers involved in big application developments (Ebay-like or internet-banking apps) but they seldomly talk about this unfortunately (although I've seen a good article about building ebay.com on TSS).

    florin
  69. Good info on this :

    http://weblogs.cs.cornell.edu/AllThingsDistributed/archives/000280.html
  70. The article gives a quick overview of caching, and mentions the idea of coherence. This basically means that most of the time the data read by any node is not older than 'x'. This, however, is not very useful for transactional applications. What's really needed is a cache with a well-defined isolation level. For example, you could multiversion time-stamp locking, like Oracle does, and 2PL (or else you don't get serializability of transactions.)

    On a cluster, this means that you have to start sending lock messages. This is not necessary problem - perhaps the message latency will be 10ms, say, when running at a decent cpu utilization. Obviously the longer the transactions, the less the level of concurrency.

    Then you have the problem of notifying the cache when the database gets updated behind the scenes. You can do this, but it's more annoying coding (java stored proc, and jConnect has a polling mechanism I think.)

    I think all in all, if you don't want to completely abandon the concept of transactions you are better off scaling the database horizontally, which you can do now - Oracle Real Application Clusters (IBM has been doing this on mainframes too.) The hardware is cheap, I think, say Dell linux boxes with Infiniband interconnect. Oracle RAC is *not* cheap (I think, I don't know) but you are saving money on the cache license, or saving even more money from not coding your own transactional cache technology.

    The thing to optimize in this approach is marshalling and unmarshalling data from and to the database. But it's probably not such a big problem. I suppose one could run the app server together with the database, perhaps inside the database JVM, so the JDBC driver could use IPC instead of TCP sockets?

    If anyone has had experience with this type of approach (horizontally scaling the database) I would love to hear what their actual results were.
  71. out of process cache[ Go to top ]

    There was an article a while back (on TSS?) about "out of process cache" servers. Basically using the network to access an in-memory cache. Since most DB's have a query cache, or even the option to pin the entire table in memory, it would be very interesting to revisit this article and compare some "out of process cache" techniques with a standard database cache. I guess the advantage of a caching approach is that distributes the cache (like Coherence) is that it reduces the loading on the database.

    In general, I agree that many J2EE developers don't realize that caching is inherent to the database. With network RTT being negligible on a LAN, the real value of "out of process" caching would seem to be reduction of load on the database.
    -g
  72. out of process cache[ Go to top ]

    geoff: There was an article a while back (on TSS?) about "out of process cache" servers. Basically using the network to access an in-memory cache.

    Coherence supports all three (in process, out of process, and hybrid), but TSS is only using the in-process (AFAIK).

    geoff: Since most DB's have a query cache, or even the option to pin the entire table in memory, it would be very interesting to revisit this article and compare some "out of process cache" techniques with a standard database cache. I guess the advantage of a caching approach is that distributes the cache (like Coherence) is that it reduces the loading on the database.

    Absolutely. If you aren't tuning the database and the queries and the connection pools and the prepared statement caches and pinning tables in memory etc. then you are missing some obvious (and basically free) improvements in scalable performance. There is no excuse for ignoring the capabilities built into the database. On the other hand, there is similarly no excuse for ignoring the potential for caching in the business or object tiers. In fact, your best bang for the buck (in terms of caching) comes from the tiers in front of J2EE, such as edge caches, static content caches, auto-expiring dynamic content caches, etc. As a rule of thumb, the further back you go (in the tiers, e.g. toward the database) the more expensive each operation gets, so the earlier you can cache, the better.

    geoff: In general, I agree that many J2EE developers don't realize that caching is inherent to the database. With network RTT being negligible on a LAN, the real value of "out of process" caching would seem to be reduction of load on the database.

    No large project should be without a dedicated database team and a dedicated J2EE team and the requisite architects and managers that will ensure that each is capable of optimizing their own without trampling on the requirements of the other. Most of the large J2EE projects that I've witnessed (and I'm referring to well over one hundred) have at least on "guru" DBA who is involved with the high level J2EE architectural decisions so that there are no late surprises.

    Listening to the conversation in this thread, it's obvious that some of the projects that people have seen have been 100% J2EE centric without enough involvement from the DBAs that would eventually be suffering from the architectural oversights. I'm not suggesting J2EE caching as a good way to be free of responsibility; far from it! I know from experience that caching in the J2EE tier is part of a responsible and scalably performant architecture.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  73. Java or SQL level caching ?[ Go to top ]

    The solution exposed in the article needs application modifications. What is the cost of debugging and maintaining such code ?
    Solutions that are performing caching at the SQL level like for example C-JDBC (see http://c-jdbc.objectweb.org) do not require any application modification and can provide advanced caching features. How would that compare to Tangosol solution ?
  74. I have a question about java.util.Map interface. Granted that java.util.HashMap provides a convenient way to store and retrieve objects by keys within the same VM, the Map interface was never designed initially to support any kind of operations that may potentially fail, such as network and database access. However, read-through and write-through behavior dictate that Map.get(...) operation may potentially load object from underlying database and Map.put(...) operation may actually store the object in the database. Moreover, in case of “read” failure, Map may chose to return null (somewhat clumsy), but in case of “write” failure (e.g. optimistic locking exception from database) the whole error handling becomes somewhat obscure.

    There is a partial solution (I have seen it in SpiritCache, but not in Tangosol) to this which involves throwing some checked exception from CacheLoader.load(...) method and then have a generic error handler for handling this exception. However, no error handler will be able to propagate this exception back to the user in an easy and natural way – which is throwing a checked exception.

    Of course, another solution is to throw an unchecked exception, but big portion of the community would not appreciate unechecked exception approach.

    Does the use of java.util.Map interface dictate throwing unchecked exceptions which wrap checked exceptions thrown by database operations during loading and storing? If yes – then maybe reusing Map interface which was designed for non-fail situations was not the best choice here... If developers do need Map-like API for cache access, maybe it would be better for JSR 107 to come up with a Cache interface which is a replica of java.util.Map interface but with added checked exceptions on certain methods (such as “get” and “put”).

    In addition I would like to say that if both, Tangosol and Spiritsoft, claim JCache compliance, then one would expect that these products should at least agree on the public API of CacheLoader interface which needs to be implemented by end users almost in any cache-related project. SpiritCache’s CacheLoader.load() method has “thows Exception” clause and Tangosol’s doesn’t (at least according to online JavaDoc). Which one is JCache compliant then?

    Regards,
    Dmitriy Setrakyan
    xNova™ - Reusable System Services for Java and .NET
  75. Toplink in a cluster environment[ Go to top ]

    I'm working on a project which has a cluster of two Weblogic7.0 servers
    against an Oracle9i and using Toplink9.0 as the persistence layer. Because we
    activated the internal cache in Toplink and being in a cluster environment,
    ... we ended up with lots of problem.

    Accordingly to the Toplink doc (although quite vaguely) we tried to use the
    "CacheSynchronizationManager" with JMSClusteringService (basically each
    change in the cache will be replicated into the others caches).

    Well, we couldn't make it working .... so we're running out of options. As
    an extreme solution ... we might have to disable the cache completely ...
    and therefore taking the performance hit.

    If anybody has some experience with Toplink in a cluster env. ... I'll
    appreciate any input.
  76. Accordingly to the Toplink doc (although quite vaguely) we tried to use the

    > "CacheSynchronizationManager" with JMSClusteringService (basically each
    > change in the cache will be replicated into the others caches).
    >
    > Well, we couldn't make it working .... so we're running out of options. As
    > an extreme solution ... we might have to disable the cache completely ...
    > and therefore taking the performance hit.
    >
    > If anybody has some experience with Toplink in a cluster env. ... I'll
    > appreciate any input.

    Have you considered a home-grown option, then, meaning JavaGroups. It should work at least as well as TopLink's cache. I doubt TopLink's cache has any nice transactional properties. I tried to use JavaGroups three years ago and it just didn't seem to cut it at the time, but by now it should be pretty reliable. I ended up using Ensemble, which is the grandaddy of JavaGroups. I can personally vouch for the reliability of Ensemble, since it succesfully replicated a cluster at a major media corporation. It ran straight for a couple months - no missed messages. I think it will do several thousand messages per second, latency a few milliseconds: http://www.cs.cornell.edu/Info/Projects/Ensemble/ftp.html

    If not, if you have the money just buy Cameron's Coherence cache.
  77. RE: Toplink in a cluster environment[ Go to top ]

    Marius,

    I am going to assume that '... we ended up with lots of problem.' is in reference to stale objects in the cache relevant to the other server and database. TopLink's cache-sync features are a great way to minimize stale data but you should also make sure that you address the concurrent access to you persistent objects using optimistic or pessimistic locking. This will identify or prevent the concurrent modification or modifications based on stale state.

    TopLink's shared cache (which is transactional) can be configured to use cache-synchronization across JMS through API or more easily through the use of sessions.xml configuration file. The use of JMS cache-sync is very popular and successful in our user base. I'll need some additional information on what you are struggling with to be of assistance.

    A recent project I met with was using JMS cache-sync configured from the sessions.xml file. The portion of the config file of interest:

    <cache-synchronization-manager>
     <clustering-service>oracle.toplink.remote.jms.JMSClusteringService</clustering-service>
     <should-remove-connection-on-error>false</should-remove-connection-on-error>
     <jms-topic-connection-factory-name>jms/DCTTopLinkTopicConnectionFactory</jms-topic-connection-factory-name>
     <jms-topic-name>jms/DCTTopLinkCacheSynchTopic</jms-topic-name>
    </cache-synchronization-manager>

    From the sounds of things you are very eager to get this working quickly so the best route may be our support organization (metalink.oracle.com). If you do not have a support contract I would recommend the user forum and provide any specific details of what is failing for you.

    Doug Clarke
    Product Manager
    OracleAS TopLink

    Some additional links that may be of use:

    JMS cache-sync docs: http://download-west.oracle.com/docs/cd/A97688_12/toplink.903/b10064/enterpri.htm#1022254

    TopLink User Forum: http://forums.oracle.com/forums/forum.jsp?forum=48
  78. Don't forget Reverse Proxies[ Go to top ]

    Although java based caching gives the most accurate and manageable environment, I would like to point to a cheap and easily implementable alternative that can help most websites: reverse proxies. Reverse proxies are ideal for read heavy applications that aren't real time. A Reverse Proxy is a proxy that sits in front of a webserver and accepts all incoming traffic. For each refresh, it reads the dynamic page from the webserver and caches it locally as html. Every subsequent request is served from its html cache. Refreshing happens periodically. Caching can be configured based on url-parameters, so you can exclude real-time parts of your website. Squid is the most well-known, but you can also use apache (httpd).

    Now, what if only some parts of the page are cacheable? For instance, when you use personalization or those nice comments at the bottom of each TSS-page? Then you can use the EDGE protocol (not yet implemented in Squid production). This protocol (for JSP tags: JSR 128) allows you to specify which parts of the page can be cached by the reverse proxy server.

    Marc Schipperheyn
    TFE
  79. I wander what is the rationale for using entity EJB's at ServerSide when the most important entity EJB container services are not used: transparent persistence and caching.
    If BMP's are used then persistence is not transparent (i.e. DB access needs to be developed, not to mention that it is unefficient). If Coherence is used that internal container's caching is not used.
    Then what remains of entity EJB container services:
    transactions, method level security, resource pooling, entity remoting (maybe I forgot something).
    Any decent OR mapper can handle transactions and resource pooling.
    Entity Remoting ? A lot is said about how this can be bad.
    Entity method level security ? Well, I'll say that this is the only reason I would use entity EJB's, but method level security on entities can be usefull in one from milion applications.

    So, I understand why entity EJB container's services (transparent persistence and caching) are not used. They are poorly implemented because EJB spec is poor (when defininng entity EJBs).

    So, let us all forget about entity EJBs like a bad dream and look forward to some realy good object persistence framework like JDO and/or Hibernate with plugable datatsources and caching services, JTA support... With AOP or CGLib it is even easy to add method level security if it is realy needed.

    Mileta
  80. I think that the reason that EJB spec lacks a plugable distributed cache and mapping API for CMP so that Toplink, Coocobase, Frontier and many others can plug there persistency engines to any AS if you are not satisfied with implementation of this engine coming with your AS.

    As to DB cache vs. AS EJB cache: the best to have both, if only one then hard to say. DB cache could be expensive. AS cache could be hard to develop/use. How will perform the ASs(without cache) runing on DB boxes with cache vs. One big DB box and many AS with cache?
  81. This approach looks fantstic in case of read intensive process.
  82. Artificial Problems, IMHO[ Go to top ]

    No doubt that caching is very important in real world, but is there _real_ necessity in so complex solutions? I mean synchronizing cache contents, write-backs, timeouts etc.

    For awhile I successfully use more simplifyed approach:
    1. immutable DTOs = no need to sync contents
    2. invalidation policy = simple DTO instance drop (triggered by JMS)
    in case of entity EJB update + lasy load
    3. independent caches allocated on every cluster node = extremely scalable arch
    4. O/R mapping on top of CMP EJBs = fast development

    And another big plus - such a cache content are alvays in synch with DB, so NO TIMEOUTS AND REFRESHES need and no bother in pres. tier development!

    My own small library satisfy all the needs in everyday business app development, I don't yet see any necessity in more complex facilities neither very proprientary nor based on newcomming JCache standard.

    best,
    --
    Mike Skorik
    Chief Architect of
    http://www.100kSolutions.com
  83. Artificial Problems, IMHO[ Go to top ]

    Mike,

    1. immutable DTOs = no need to sync contents

    Yes, for read-only data, even a Hashtable is sufficient for caching.

    2. invalidation policy = simple DTO instance drop (triggered by JMS) in case of entity EJB update + lasy load

    This has two problems: (1) the data that you are working with the most is going to always be thrown out of cache precisely because you are working with it, and (2) there is a staleness window because you are using an invalidation approach against transactional data. So, for example, your app may appear to work correctly, but every once in a while in production it will make transactional decisions using stale data.

    3. independent caches allocated on every cluster node = extremely scalable arch

    This, if it can be done, is optimal (up to a certain cache size). The more independent (i.e. stateless) the machines are, the more easily they can scale.

    And another big plus - such a cache content are alvays in synch with DB, so NO TIMEOUTS AND REFRESHES need and no bother in pres. tier development!

    As I mentioned, the cache content is not always in sync with the DB because the DB is transactional, the cache is not transactional, and furthermore your invalidation approach is asynchronous. For most applications, this is probably sufficient, although it allows data corruption to occur if the data transfer does not assume the potential for data changes having occurred on the back end (i.e. employing full optimistic concurrency checks.)

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  84. First of all: Let's say I'm rendering data to the user, which in the end will happen at each step in a web application. Now what's the difference between "I took stale data in the 1ms time frame before the cache is invalidated" and "I took and rendered the correct data, but 1ms later it was changed"?

    E.g., in an online book store, a user may see a book as still being in stock, while another user is just about to check it out. E.g., on a forum, the user may see no posts yet in a thread, although someone else just hit the "submit" button. E.g., on a travel agency site, the user may see a hotel room as still being available, even though someone else is just completing a reservation. Etc.

    It doesn't matter if it's because of the cache, or simply because of the nature of HTTP. The user is still seeing incorrect data either way. And the system must still check it against the real data when actually completing a transaction.

    Second: The "Real World" tends to work by somewhat more relaxed rules than the textbook CS examples. E.g., an online store will actually _want_ to let you goods which are not in stock. They'll just notify you that you'll need to wait a bit while they order some more, and offer you a chance to cancel the order or order someone else if you're in a hurry. E.g., on a travel agency's site they might send you an alternate offer, if two people booked the same room. E.g., even a bank -- which is incorrectly the canonic example used for ACID transactions -- doesn't really work that way. In reality, when you transfer money from account A to account B, it doesn't happen within a single DB transaction at all. It is just prepared to undo later if something go wrong.

    I.e., in the real world, very few things actually absolutely need to be in a transaction. Stuff like recording the order and charging the customer, well, that belongs in a transaction. Stuff like retrieving product data, stock, or whatever, might as well be cached or non-transacted, or even retrieved from a different database.

    And, again, see the first point: once the user's browser displayed that data, it's "cached" on the user's screen, and may well be stale there.

    Third: there's a _lot_ of data which _is_, for all practical reasons, read only. E.g., in an online store, the item descriptions or the links to the manufacturers' sites, change very very rarely. Ditto for the prices. They might change every week, rarely every day, but unless you're operating a stock trade site, they'll not change every second.

    And, again, see the first point: once the user's browser displayed that data, it's "cached" on the user's screen, and may well be stale there.

    So, IMHO:

    1. I'd still advise tunining the database and the database model first. Sometimes there may be no need for caching, after all.

    Or, more correctly: it might be cached just as well by the database. Oracle or DB2 _can_ cache the those description and price queries for you. Quite efficiently too. A lot of the mis-conception that "SQL queries are too slow" is based on flawed benchmarks or poorly configured databases. E.g., a single thread benchmark will show the network latency as a huge factor and pain a false picture that "hey looky, our Java cache is 1000 times faster". Whereas a proper load test might paint a very different image.

    But:

    2. if your load tests _do_ indicate a need of caching, in practice there's (IMHO) a _lot_ of data which does fit Mike's model perfectly. Of course, it takes some design work to identify which data, and when. But then that's just normal.

    3. of course, if you're at that point, you might as well throw the whole EJB stupidity out the window too, and gain control over what really needs to be in a transaction, and what doesn't. You'd be surprised the kind of speed-up that can be obtained in some cases, when you don't have the container's straight-jacket wrappers around an Oracle Connection or PreparedStatement.
  85. you catch it ;-)[ Go to top ]

    First of all: Let's say I'm rendering data to the user, which in the end will

    > happen at each step in a web application. Now what's the difference
    > between "I took stale data in the 1ms time frame before the cache is
    > invalidated" and "I took and rendered the correct data, but 1ms later it was
    > changed"?


    Exactly!!
    user thinking time >> JMS triggering time

    > Or, more correctly: it might be cached just as well by the database. Oracle
    > or DB2 _can_ cache the those description and price queries for you. Quite
    > efficiently too.

    Agree again!
    More than, for DB to operate efficiently - at least indexes should be placed in memory. As to my scheme - the usual routine here is in obtaining ID vector from DB and than (at pres tier) render detailed list to the end user fetching DTOs from the cache.
  86. Cristian: First of all: Let's say I'm rendering data to the user, which in the end will happen at each step in a web application. Now what's the difference between "I took stale data in the 1ms time frame before the cache is invalidated" and "I took and rendered the correct data, but 1ms later it was changed"?

    Nothing. Your points are all very valid, and while I have commented on the same before, your explanation of the same was much more cogent. If you don't have transactional requirements then staleness becomes acceptable for asynchronous applications, such as web browsing. Further, if your database can handle the load easily and the application is fast, then why complicate the architecture? These are all good points, they just aren't indicative of the customers and projects that we work with. In the case of this particular article (about TheServerSide.com and its use of caching), without caching in the application tier their database server was dying (literally and repeatedly), and even before it would die, pages were taking tens of seconds to serve. Caching, for their application and environment, turned out to be a good thing.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  87. Very good points. Always correctly analyze the database problem before jumping to conclusions. Often the solution to the problem gets resolved through simple DB tuning or with small changes to your application.

    For instance, with previous versions of popular RDBs, an application being allowed to generate ad-hoc dynamic queries could miss out on a natural join clause resulting in a cross-product of more than a million rows. The rogue query causes everyone else to pay a significant penalty with a good portion of the data that would normally be in memory (e.g. Oracle SGA) to now move to the 'TEMP' tablespace. I wonder if current versions of popular DBs can easily be protected against stupidity (would options like pinning in buffer cache always be honored?).

    There are several tools in the marketplace that offer a quick means of understanding your query patterns from within application servers without any code-level intrusion. For instance, we offer, a JDBC interceptor (www.gemstone.com/gemfire)cache that allows the application to monitor and learn the DB access patterns across a cluster presenting a unified view of all DB activity from the perspective of a distributed application. A good tool will apply simple rules to automatically tell if the application is cache worthy - monitor repetitive queries, time consumed in the database, DML operations that impact cached results, and so on.

    Here are a few other points I would take into consideration for modern day architectures:
    1) We look at caching, not as just a middleware plugin for caching data from a single database, but, also as a distributed data fabric that abstracts data from many data sources and even applications. We believe, most enterprise class applications in the future will rely on multiple data sources, some of these changing, in real-time. For instance, we have a client that is working on a "bond" portal where products and prices cached are used for rendering screens but the exact data used for rendering depends on both the user request and click-stream real-time information delivered from another channel.
    2) More and more in the future, you will see resident data in databases being integrated in real-time with streaming data. Financial trading is an obvious example where financial tick data delivered at high-speed from a exchange is integrated with reference data delivered from multiple vendors and sources. Databases are quite inappriate for such stream-data management. You need a high-speed distributed main-memory data management layer that can work natively with streams, databases and messaging systems.
    3) How do you handle DB demand spikes: In my experience with typical database designs, the resource allocation planning - tablespace size, growth rate, disk organization, indexing, memory requirements take into consideration average and peak loads. But, with a lot of apps (like portals) this is not known until the app goes into production. Demand on popular portal increases over time, and can surge through successful marketing campaigns, M&A, etc. Also, the number of applications hitting the single database grows over time further complicating or making the initial capacity planning somewhat useless. Now, throw on top of this, automated interfaces such as Web Services that can create sporadic load on the DB engine. Unless, you don't mind costs associated with replicated databases to spiral out of control, middle tier data management is a no brainer.

    Cheers!
    Jags Ramnarayan
    GemStone Systems (http://www.gemstone.com)
  88. Artificial Problems, IMHO[ Go to top ]

    1. immutable DTOs = no need to sync contents

    >
    > Yes, for read-only data, even a Hashtable is sufficient for caching.
    >
    What is the reason to mutate DTO? Entity bean is for this purpose!

    > 2. invalidation policy = simple DTO instance drop (triggered by JMS) in
    > case of entity EJB update + lasy load

    >
    > This has two problems: (1) the data that you are working with the most is
    > going to always be thrown out of cache precisely because you are working with
    > it, and (2) there is a staleness window because you are using an invalidation
    > approach against transactional data. So, for example, your app may appear to
    > work correctly, but every once in a while in production it will make
    > transactional decisions using stale data.
    >
    (1) of course, but this is just 1% of all the data I operate ...
    (2) yes, you right, but when I really need valid data, it can be easily got from entity EJB directly. I do not use cache in business tier.

    > And another big plus - such a cache content are alvays in synch with DB,
    > so NO TIMEOUTS AND REFRESHES need and no bother in pres. tier development!
    >

    >
    > As I mentioned, the cache content is not always in sync with the DB because
    > the DB is transactional, the cache is not transactional, and furthermore your
    > invalidation approach is asynchronous. For most applications, this is
    > probably sufficient, although it allows data corruption to occur if the data
    > transfer does not assume the potential for data changes having occurred on
    > the back end (i.e. employing full optimistic concurrency checks.)
    >
    IMHO it's enough in _any_ application for rendering data at presentation tier
    (see Cristian's message and my reply ...)

    best,
    --
    Mike Skorik
    Chief Architect of
    http://www.100kSolutions.com
  89. GigaSpaces Distributed Cache[ Go to top ]

    Hi,

    Thought this might interest you all:
    http://www.gigaspaces.com/download/GSDistributedCachingWP.pdf

    GigaSpaces Distributed Cache got JavaSpaces and MAP interface!
     
    This is part of GigaSpaces 3.2 release.

    Best Regards,

            Shay

    ----------------------------------------------------
    Shay Hassidim
    Product Manager, GigaSpaces Technologies
    Email: shay at gigaspaces dot com
    Website: www.gigaspaces.com
    GigaSpaces - Be Local Sync Global