How many projects have you worked on where the bottleneck in testing was the database, for both performance and scalability? In my experience, getting to the data, creating new data, and updating data takes a fair amount of effort.
If you are developing a small application, then you may not be worrying about scalability as much. However even small applications like to run fast! When you get to larger enterprise applications, then the concern grows significantly. As you add more and more concurrent users your poor database gets more and more bogged down.
This article discusses how caching data in front of the database can give our DB a break, and allow for faster running, and more available applications.
In particular, We will look at:
- What is clustering?
- Using a distributed cache
- Caching strategies
- Read-through / Write-behind caching
- Technologies that integrate nicely into this architecture
What is clustering?
Clustering is one of those words used in computer science to mean many things. It is over used like other words in our industry such as "Sessions" :-). I am going to refer to clustering in the sense that "you can add more instances to your cluster and the application scales in some fashion."
There are many issues with clustering. Have you ever had a manager say something like:
"If we buy another box, our app will run twice as fast right?"
This is of course rarely the case. We can obtain an increase in performance, but only if we design our application to cluster for performance.
How about this gem:
"If we buy another box, our app will be twice as reliable won't it?"
Again, this isn't always the case. Higher availability will only occur if you design your application to cluster for availability.
Scaling Application Performance
In a distributed system, we can add more boxes running our application to the backend. Our application will only scale for performance, if by adding these boxes you enable the entire system to do more work concurrently, accommodate more users, etc.
It is fairly easy to scale for performance if your application lives in its own world and doesn't share anything. A key to understanding the scalability of an application for performance is to characterize the application's requirements to share objects. If the shared object is read-only, then it can make copies, which allows us to scale. We get into problems when shared objects need to get updated. Depending on the policy for updates, we can shoot ourselves in the foot here. If updates are fairly infrequent, then the work of making the updates allows us to scale nicely. However, if we are updating very frequently, then the work of pushing out updates can actually SLOW down the application. At this point it is better to have one point of contention which is managed, than to have copies that we try to keep in sync. This is the game that we play. How often do we need to keep our copies in sync? How strict do we get (do we need 100% coherence? Or can we get away with a lag). We need automated testing to make sure we are doing the Right Thing™.
Scaling Application Availability
We have similar considerations with applications that scale for availability. Application availability is limited to the availability of the system. We can get a "better box" (add more redundancies etc), but can only go so far. When we deal with distributed architectures, adding more boxes does not necessarily increase application availability. It all depends on what services are shared. The worst case is that all of the machines need to be available for the application to run. For example, you have one web server, one application server, and one database. If one of the machines dies, you are in trouble. In this scenario we are LESS available than if the entire application was on one box.
So, just like with performance, life comes down to sharing objects. If we build nice stateless architectures where we can easily add machines to the system, then we can get extra availability.
Let's see how a distributed cache can help.
Cache is King: Distributed Caching
TheServerSide started off its life running on a nice little Linux box. We had one application server, and the database was on the same machine.
We were tasked with having TheServerSide run on a cluster of different application servers. Ok, so we get machines for each application server, one for a database machine, and it should run a lot faster now right? We have X machines instead of one! Wrong. When we first took the app and ran it in this configuration it was deathly slow. The reason was quite simple. With the single application server we were able to set a flag to false (db-is-shared in WebLogic) that told WebLogic that he was in charge of things. The WebLogic instance could assume that all access to the database would be through the application, and hence it could be cached between transactions. After the site had been running for awhile, the entity bean cache would basically be containing the data required for the system. The database would only be contacted when updates/creates/deletes happened. This was nice and zippy. Unfortunately it is fairly rare to be in this privileged situation with applications in the real world.
The db-is-shared magical flag has to be turned on if you fall under any of the conditions below:
- You want to run more than one instance of the application
- E.g. using WebLogic clustering
- Other programs access the database behind the back of the application
- Outside processes
- DBAs sneaking around changing things
NOTE: The db-is-shared flag is now called cache-between-transactions in recent WebLogic versions
So, it is a rare day where you can say that your application is the sole access point to the database, but take advantage of that situation if you are lucky enough to be in it! Use your appserver caching facilities!
Without this flag, our various application servers were always guessing about the data that lay behind them. "Hmm, I don't know if this entity has been changed, so I better go to the database and reload it just in case." All of the ejbLoad/ejbStore traffic that was taking place for no reason brought the application to a crawl. We now had to take the application that was NOT built for scalability of performance, and change it.
A lot of applications are read-heavy, especially web-based applications. There are many more concurrent users reading our news, forums, patterns, articles, etc, compared to those posting information. This fact allows us to design our application to focus on fast reads.
We looked at different solutions, and decided on using a distributed cache that would sit in front of the database. This cache would contain a subset of information that the database contains, and would enable the application to retrieve the data VERY fast.
JCache - JSR 107
One of the oldest JSRs around is the JCache API (JSR 107). Oracle originally started the effort (and is spec lead) and donated ideas from their implementations (like BC4J). However, a long time has passed, and nothing has come out of the expert group to date. A few rumors suggest that JCache is NOT dead, and will most likely be based on wrapping around the java.util.Map interface that all Java programmers know and love.
The implementation of a distributed cache that we chose was Tangosol Coherence. The Coherence product has been used in many large projects, and prides itself on the fact that it is FAST, and NEVER loses data. Coherence appears in the programming world as a java.util.Map that you can work with like any other Map that you are used to working with. It just so happens that behind the scenes the caching architecture is making sure that the data is in the right place. So if you put something into cache X, you can retrieve it on another machine with the same key, as though it was in memory there.
We strapped Coherence into our Entity Bean layer. Now, when we do an ejbLoad() we first look into the cache to see if it is already loaded. If it is, we pass the reference back; if not, then we load it into the cache from the database. What do we put into the cache? The keys are the Primary Key classes for the given Entity beans, and the values are the Data Transfer Objects (DTOs, or Value Objects) themselves. If an entity is deleted or updated, then it is nuked out of the cache, to be reloaded in another transaction (in ejbRemove()/ejbStore()).
Here are some snippets from a BMP Entity bean that represents a Thread in TheServerSide. A couple of EJB methods are shown, with the code for interfacing with the cache.
This work took a few days to do, and means that once again, when TheServerSide application is up and running, it basically contains most of the data within this distributed cache. Reading a DTO out of the cache is immediate, and this is how TSS was able to actually run faster in the cluster, as it scales for performance.
Does it scale for availability? Yes. We can startup/bring down machines within the cluster and no one is the wiser. As long as one application server is up and running, the site can work fine.
Distributed Caching Strategies
When you configure your caching architecture, there are many things to think about. It isn't necessarily as easy as using the default caching that your product will give you. Here are some options:
Imagine having two JVMs running an application with a joint cache. If the application in JVM 1 creates a new object and places it in the cache, how will JVM 2 get access to that information? A replicated cache keeps a copy of the entire cache contents on each cache node. If memory isn't an issue, and there are only a few nodes, this can be the best option. Read requests are always very fast, as they are going to be similar to reading from any object in memory (as the data is already there in the cache). If you have a huge set of nodes then you may want to look to another strategy, as each put into the cache will cause many events kicking in, and data going to the entire set of nodes. Thus, the more nodes you add to the cluster, the less each will potentially perform.
Figure 1. Shows that when an object is placed in a replicated cache, each node gets a copy
To contrast the replicated cache, a partitioned cache will not have to contain the entire contents of a given cache. Typically there will be a way to partition the data in a cluster, so each node will share the burden. What do you get with this kind of cache?
- Load Balancing: Since the data processing is shared among the cluster nodes, you have some kind of automated load balancing of this work.
- Location Transparency: Although internally data may be living on various nodes in the cluster, the program has no idea about this abstraction. Developers just get objects from the cache, and the implementation works out where to get the data from.
- Failover: Some partitioned caches will allow for some kind of failover. This often means that a given node has a "backup" member which contains a copy of the data. If the primary node goes away, the backup steps up to the plate (and itself chooses a new backup... and the world goes on).
Since the burden is shared among the nodes, this type of cache can scale more linearly than the replicated cache will be able to.
Figure 2. Shows that when an object is placed in a distributed cache, each node shares the copies
A lot of work can go into making sure that a given cache is coherent. Some use cases don't call for this guarantee. Often "most recently known" values are just as good as "this is definitely the right current value". For example, when we look at stock quotes online, we normally get data that is lagged behind the exact current value. Brokers themselves, however, need more up-to-date stats. Optimistic caches are normally more loose with their data, with increased performance.
Reading through, writing behind
The caches that we have seen have the onus of the developer for putting data in, and extracting it out of the cache in their application code. For example, our entity beans would read from the cache in ejbLoad() to see if the DTO is in memory. It also loads up the cache from data in the database. Wouldn't it be nice if the cache could maybe take care of this for you? Often they can.
You can often configure a cache to load data from a store (db, web service, LDAP, legacy) on a cache miss. This is called reading through.
When an object is placed into the cache, it can be updated/created in the given backend store. This is called writing-through or writing-behind. Often you can tune the interval frequency of writing the data back to the store.
We used a read-through/write-behind cache for a recent feature coming up on TheServerSide. We didn't used to track the number of views that a given thread received on our community. Reading a thread is a read-only activity, and we didn't want to impose any delay in that use case for the user. Our solution was to create a cache that contained DTOs which tracked the number of views on a given thread. When a thread is viewed by a user, the cache object is incremented, and flow moves on. At a given frequency, the cache itself will call out to a cache hook which will update the count in the database. The larger the frequency the less frequent the DB access. This allows you to tweak the DB contention from one nice location.
An important note about this solution is that the developer that uses this cache is simply running gets and puts on a Map interface. The cache developer has the burden of writing "what do do" on reading and writing to the database.
Cache Miss / DB Load
On a cache miss something like the following would be called:
The key that the cache was passed on the cache-miss gets passed down the chain, and this code goes to a backend database to retrieve the value. The value returned from this method will then be put into the cache, and back to the application code that called cache.get( pk ).
Cache Write / DB Insert or Update
When a developer puts an object into the cache via cache.put( key, value), the following may be called at some point (depending on the write settings):
This kind of setup gets you rolling in a "Cache is King" mentality. The cache becomes a lot smarter than just a data container, and you end up going to the cache itself, and don't even think about hitting the database directly. This allows you to abstract the final persistence strategy completely.
Crude to elegant: Isn't it nice when technologies mix?
We have looked at some examples of using distributed caches, different configurations for different situations, and how we used distributed caching on TheServerSide. The cache itself is really nice, but what is even more compelling is how it can be integrated with other technologies such as:
Java Data Objects (JDO) has a cache manager built right into the spec. When you combine the power of JDO (transparent persistence), and distributed caching, you get a very powerful data tier. Now our applications can have transparent persistence which is high performing, even on a large distributed scale. I would definitely be compelled to choose JDO with cache for large enterprise projects. Some JDO vendors come with some notion of a distributed cache, and others integrate with cache products, such as SolarMetric Kodo + Tangosol Coherence.
Messaging is such a useful concept in computing, especially in the enterprise. In many situations we need to make sure that messaging is guaranteed, and often we turn on database storage. A nice combination can be having the message broker storing messages into the distributed cache itself (which can itself be backed by a database).
Although not its primary use, the distributed cache itself can be used as a micro-messaging system. When objects (messages) are placed on the queue (in the cache) the consumer (the write-behind listener) can take that message and do whatever it needs to with it. While in most circumstances a "real" JMS provider would be ideal, this system can definitely work!
You want clustered JNDI? Your application server doesn't give it to you out of the box? Share the JNDI tree in a cluster!
We almost think of the cluster as a low level service that is available as an abstraction to the database itself. When you get using the cluster, and start to enjoy the speed and reliability, it is amazing how often you think about using it... in many use cases.
On many projects the poor old database doesn't get given a break. Managing the data tier isn't an easy task, even though there are many potential solutions out there. If possible, I try to stay away from having to cluster the database machines themselves, although in some situations you find that you have too. In these cases I try to divide out the tasks to different databases (e.g. a backend DB for nightly processing, a read only cluster, etc).
Many technologies give you some kind of caching out of the box, such as EJB CMP 2.x, and JDO. When these technologies are married to distributed clusters you can tweak the data tier to give you the optimal scalability for both performance, and availability, as needed in your use cases.
Hopefully the JCache API specification will surprise us and get into the fast track, and then we will be able to write to a vendor neutral specification for our distributed layer. This doesn't mean that different caching vendors will be "plug 'n play", but at least the code will stay pretty much the same, and most of the porting time would be in our lovely XML files. :-)
About the Author
Dion Almaer is the Chief Architect of TheServerSide.com J2EE Community and formerly a mentor and trainer at J2EE experts The Middleware Company. Dion is a columnist on Enterprise Java topics at onjava.com and theserverside.com.