Java Development News:

Jakarta Slide's Transactional Storage System

By Oliver Zeigermann

01 Mar 2004 | TheServerSide.com

Introduction

This article delves deep into the guts of Slide's transactional storage system. Guided by samples from real code, you can see how simple transactional caching might work and what issues need to be addressed. Based on this, you will learn how locks are used in Slide's transactional file system to get full ACID transactions.

Basically, this article can be interesting for everyone working or having to deal with transactions. It is also a short tutorial on how to create transactional systems yourself and an introduction on how this is done in Slide.


What is Slide?

Slide is a WebDAV compatible content repository hosted by the Jakarta project of the Apache Software foundation. You can store and retrieve all kinds of content resources from Slide, attach metadata to them, maintain multiple versions, have access control and search over full text as well as over metadata. WebDAV extends the HTTP protocol. To the well known GET, POST and PUT methods of HTTP, WebDAV adds methods like PROPFIND, SEARCH, LOCK, CHECKIN, and many more. Slide also includes client software and an extensive test suite, which will not be covered in this article.

Using the WebDAV protocol, Slide can be used as a standalone system that manages your content and metadata like a relational database system similar to how Oracle manages formal data. For this purpose you can simply download Slide and run it. Another application might be to include Slide into your software using its native Java API that supports the same operations as the WebDAV layer. If the data you work on can be seen as content, with Slide you have a persistence layer ready that is even able to work without a relational database.

Slide's server architecture consists of a core responsible for locking, versioning, authentication and security, search, transactions and related tasks. Built on top of this is the WebDAV layer which provides the additional HTTP methods as defined in the WebDAV, ACL, DeltaV, and DASL standards. The core itself uses a highly flexible two-tiered persistence layer. Using configuration files, resources identified by hierarchical URIs are mapped to compound logical stores. These logical stores delegate different aspects of a resource to possibly different physical stores.

Having the storage organized like this may seem complicated, but it allows for some very useful configurations. You could, for example, store XML documents in a certain folder that maps to a dedicated XML database while Word documents are stored in a different folder and go into the file system. Or you might store all content into the file system and all meta data into a relational database. As Slide already features physical store implementations for a transactional file system, relational databases and the Tamino XML database, all these configurations are actually possible with Slide.

This article will try to shed some light onto how caching takes place in the first tier and how Slide's transactional file system works as a physical store in the second tier.


Transactional Caching

You probably already know that access to data on persistent storage tends to be slow. Due to Slide's cleanly separated tiers, WebDAV and core tiers generally access the storage tier multiple times inside a single request - sometimes asking for the same data over and over again. This would slow Slide down a great deal.

What is the solution? Caching. When you think of a cache, what features would you require of it? I think most of us can agree on at least these requirements:

  1. If something is put inside a cache it is desirable to have it accessible as long as possible
  2. If an entry is actually found, it must be valid
  3. The cache must not grow beyond its well defined boundaries
  4. The cache should never block
  5. The cache should be significantly faster than the entity it is caching

While these requirements seem to be clear at first glance and should not be too hard to accomplish, it is not that easy. The main thing that makes this especially difficult is Slide's transactional support. This means not only the stores, but also the cache for them needs to support transactions. Obviously, of the well known ACID transaction properties, durability does not apply to caching, although caching must not prevent the underlying store from asserting the durability property.

Seeing it this way, it should be clear that the validity of entries actually is an issue. This is because there might be entries that have to be rolled back at the end of a transaction or issues related to isolation. When one transaction writes an entry into the cache and later tries to read it again it should get the same value, either from the cache or directly from the persistence layer. It is also desirable that changes made by one of several concurrent transactions are invisible to the other ones. As speed is the motivation for introducing a cache, it should be clear that its own performance is an issue. This also explains why it is highly desirable to never have it block or wait for anything. Another constraint arises from the field of software design. It is called keep it simple, stupid.

Slide offers a partial solution to all the requirements of the ideal cache. Its implementation of a transactional cache is based on the LRUMap class from the Jakarta Commons Collections. This map determines the caching strategy: least-recently-used. This means when the cache is full, the entry that has not been used for the longest time will be removed from it. Apart from this map, which serves as the global cache all transactions share, there are also a number of caches that are private to each transaction. These caches are used to locally store information about changed, added or deleted entries until the transaction either commits or rolls back.

There are two types of caches inside Slide, one for content that stores bytes and another one for storing metadata objects. As the one for content data is too complex and even weird in some respects, we will restrict ourselves to the metadata cache. As this cache makes no difference between changed and added entries, the protected fields of the caching class look like this:


protected Map globalCache = new LRUMap(globalCacheSize);

protected Map txChangeCaches = new HashMap();

protected Map txDeleteCaches = new HashMap();


Now every time a transaction is started, a new HashMap to hold the modified entries and a new HashSet containing the keys of the deleted entries will be added to the appropriate maps. This results in a map of maps respectively to a map of sets. This is how the code to start a new transaction may look like:


public synchronized void start(Object txId) {

txChangeCaches.put(txId, new HashMap());

txDeleteCaches.put(txId, new HashSet());

}


Like almost all methods of the cache it needs the transaction identifier in order to assign the action to the appropriate transaction. You can also see the size of the locally cached changes is not limited. This obviously is a violation of one of the properties you would expect of a cache. There certainly are ways to have them limited in size as well, but this would endanger the keep it simple, stupid principle. As we handle metadata only, not content bytes, this restriction is one we can live with. Let's see what happens when we add an entry inside a transaction:


public synchronized void put(Object txId, Object key, Object value) {

Map changeCache = (Map) txChangeCaches.get(txId);

changeCache.put(key, value);

}


This is simple and straight forward. Instead of storing the entry to the global cache we add it to the temporary local one. We thus hide our change and defer the decision of what to do with it until later. The simplified code for retrieval inside this transaction looks like this:


public synchronized Object get(Object txId, Object key) {

Map changeCache = (Map) txChangeCaches.get(txId);

Object changed = changeCache.get(key);

if (changed != null) {

return changed;

} else {

Object global = globalCache.get(key);

return global;

}

}


As you can see the local cache is checked first to reflect local changes and if there are none, the global cache is consulted. The return value of null indicates there is no valid entry and the data should be looked up from the database. Finally, when the transaction comes to an end, it will either be committed or rolled back. Committing the changes will mean all local changes must be propagated to the global cache, rolling back will discard all changes. Here is the code for commit:


public synchronized void commit(Object txId) {

Map changeCache = (Map) txChangeCaches.get(txId);

for (Iterator it = changeCache.entrySet().iterator(); it.hasNext();) {

Map.Entry entry = (Map.Entry) it.next();

globalCache.put(entry.getKey(), entry.getValue());

}

}

We omit the code for rollback as it simply removes the local cache associated to the transaction.

The real code is a bit more complicated as it also handles deleted entries and does some hit counting and logging as well. However, the main principle should have become clear. You can also see it is pretty simple and obvious.

Too simple in fact! The problem lies in what the get and commit method do. Remember, on commit all local changes are simply copied to the global cache. Now consider two transactions that run in parallel and access the same data entry. One reads a certain entry and the other one writes to it. The reading transaction accesses the global cache as local copies are only created on write. Thus, when this transaction reads from the global cache before commit and then again after commit of the writing one it will read two different values. This is known as a Nonrepeatable Read.

There also is a violation which is much more severe. Imagine the same value is first read and then written depending on the value read by two transactions running at the same time. As an example, imagine a document is first read, something is appended and then it is written in two transactions concurrently. With our caching strategy one of the updates to the document is likely to get lost. This is known as the Lost Update problem. What does our cache do about this? Simply nothing! That's why our cache is pretty bad in terms of isolation! Although, this is no problem, as the scenario described above can never happen in Slide. A read would be one WebDAV request and a write would be another. Transactions simply are not allowed to span more than a single request. To assure correctness WebDAV provides a special LOCK and UNLOCK method that may lock a whole document for the time of the update.

There are plans to add transactions spanning more than a single request to Slide. For this purpose the existing caching is simply not sufficient. Because of this, a new caching system using no blocking locks and a sophisticated spilling mechanism providing for serializable execution schedules is already in development.


Slide's transactional file system

We now turn to Slide's transactional file system. Its purpose is to serve as Slide's default store implementation to enable it to run without a relational database or run in combination with a database. Before we delve into its implementation details, let us recapitulate some issues we already discussed.

Caching used in Slide is pretty weak in terms of isolation. Why is this so? Mainly because it has no locks, and locks are required to reliably prevent violations of isolation and correctness issues. While Slide's transactional file system uses a strategy of local changes very similar to the caching described, it has locks as an additional feature. Using these locks in a two phase locking manner, it actually provides the isolation level "serializable".

What else do we have to consider when we want to have a transactional file system? Let's have a look at the ACID properties in detail:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Atomicity can be solved by the mechanism of local copies as introduced in the caching section. Isolation and Consistency now are guaranteed by two phase locking as there are proofs that show two phase locking guarantees a serializable schedule. Thus Durability remains as our problem, right? You might say Durability is no problem as it is already provided by the file system in the first place! That's true, but together with the Atomicity property it actually becomes a problem.

To illustrate there really is a problem, consider what the cache does when committing a transaction. It copies all locally modified entries to the global cache. Is this atomic? It is, as the whole commit is wrapped in a synchronized block. What if a power loss occurs while committing? Well, worst thing is the cache is gone which is no problem, as a cache is a volatile device! What if we took over this practice from caching to our file system? We would then move files stored to local temporary directories to their main location upon commit. Again, what about a power loss in the middle of doing this? In this case it will be a big problem, as after a restart only half of the files would have been copied. This would violate Atomicity for sure and most likely Consistency as well. What about the other way around: Make modifications to the main files and keep backup copies in local temporary directories? We can do this as we now have locks that might disallow access to those main files while a transaction does work on them. Would this solve our problem? No, as in case of a rollback we would have to copy those backups to the main directory which again may fail in the middle of it.

Now that we have understood the problem, let us turn to the solution for locking and durability in Slide's transactional file system.


Locking

Slide's transactional file system knows four different locks called ACCESS, SHARED, EXCLUSIVE and COMMIT. Let us concentrate on SHARED and EXCLUSIVE as they are sufficient to implement a correct two phase locking - ACCESS and COMMIT are useful only when lower isolation levels are desired. Every time a resource is accessed for reading it attempts to acquire a shared lock. This shared lock can coexist with other shared locks on this resource, but not with an exclusive one. If an exclusive lock on this resource is already held by another transaction, our transaction will have to wait until this exclusive lock is released. This is also true the other way round. If a transaction tries to acquire an exclusive lock on a resource that is already locked shared, it has to wait as well. An exclusive lock is needed when a resource is accessed for writing. Of course, exclusive locks can not co-exist with other exclusive locks either.

And, well, basically that s it. The only open issues here are what do we do if a lock can never be acquired? This may be the case when either a transaction "forgets" to release its locks as it never commits or rolls back or when two transactions mutually wait for each other to release a lock. The first scenario is called live-lock, the second dead-lock. The first lock is live as the transactions could resolve this scenario without help from outside: the loitering transaction could simply decide to commit or roll back. In the second scenario it is impossible for the transactions to do just anything, as both are blocked mutually waiting on each other.

A solution for live-locks are timeouts. The best solution would be to roll back a transaction after a generous amount of time which might've even been changed by the transaction itself. For simplicity our implementation has no active watchdog threads running, but can only influence transactions when they make calls to the system. Unfortunately, if a transaction simply does nothing, then there is no way to stop it and make it release its resources and locks. That's bad luck, but what we can do is to roll back the transaction that actually tries to acquire the lock, which is what we do. This would be of little use if there actually were loitering transactions in Slide. Well, this simply is not the case, as transactions cannot be accessed or influenced by WebDAV requests. As already mentioned in the caching section there are plans to make transactions controllable over WebDAV. In this case active timeouts would be indispensable.

If this way of resolving a live-lock is never used, why have we come all the way to explain this mechanism? On one side, to make you see a limitation of the implementation, on the other side because this mechanism is an easy way to resolve a dead-lock! The most elegant way would be to analyze which transaction depends on what other's locks each time a transaction tries to acquire a new lock. If you find a cycle in this dependency graph, you would have to roll back this transaction as a deadlock victim. Our implementation does not perform such an analysis, but simply assumes there is a deadlock after a certain amount of waiting. This certainly is an issue to be addressed and corrected in future releases.


Durability

To store all information about a transaction, a TransactionContext class has been introduced. Among other minor maintenance information, it encapsulates:

  • The transaction identifier,
  • The state the transaction is in,
  • A list of all locks held by this transaction, and
  • A list of resources, i.e. streams, touched while this transaction is running.

As a clue to how Slide's transactional file system solves the durability problem described above, let me tell you that some of the information given above is made persistent. The most important persistent information is the identifier, the state of the transaction and which files have been modified. Having this information persistent guarantees we can always recover the state of a transaction that might have been interrupted in the middle of ... whatever. To illustrate it, here is the simplified code for commit.


public void commitTransaction(Object txId) {

TransactionContext context = getContent(txId);

 

context.status = STATUS_COMMITTING;

context.saveState();

context.commit();

context.status = STATUS_COMMITTED;

context.saveState();

context.cleanUp();

}


As in our code for caching, we get the id for the transaction which we then use to find our transaction context. Now we can see what the trick is. First of all, we set the state to committing and make this information persistent by saveState(). After that we begin the process of committing which copies all modified files from our local temporary directory to their main location:


public synchronized void commit() {

closeResources();

upgradeLockToCommit();

moveRecursive(new File(localChangeDir), new File(mainDir));

freeLocks();

}


First, all resources involved in our transaction are closed in case the user forgot to close them. After that, all locks held by the transaction are upgraded to commit locks, which means no other concurrent operation is allowed. Then all modified files are copied to the main directory and finally all locks are freed. This illustrates the strict two phase locking mechanism used here. You never release any locks or do any downgrade, but only acquire locks or upgrade them until you commit the whole transaction. At the end of the commit you free all locks.

Getting back to the code in commitTransaction: After commit we set the state to committed and make this information persistent again. Finally, we do some clean up which mainly involves deleting temporary files and directories.

The most important part of code omitted here deals with catching exceptions and in case one occurs, putting the system into dirty mode which requires a recovery run before further requests will be possessed. This recovery run can be invoked manually, but is also done on system starting time. If anything really bad - a power loss, virtual machine or out of memory error, etc. - happens during commit what can happen to our system? Is consistency endangered? There are three cases to consider:

  1. Before the commit process is started and STATUS_COMMITTING is written to disk
  2. After this, but before STATUS_COMMITTED is written
  3. After STATUS_COMMITTED is written
  4. After the commit process is ended

Obviously, case (1) and (4) are no problem as the transaction is either completely committed or not at all. Case (3) isn't that bad either as the only thing missing is the cleanup of temporary files. No danger for consistency either since the recovery mechanism will simply delete those files when it finds a transaction in committed state that still has files associated to it.

This leaves us with case (2). This really is a problem, as any number of files may have been already copied and others may not. In this case it is important to remember no requests are handled by the system when it is in dirty or recovery mode. This means no temporary inconsistent data can be accessed from outside. The good thing is that there is always enough information to resolve this temporary inconsistent state, no matter where it failed. Even more, the recovery run is strictly single threaded and thus sequential, which means no concurrency issues have to be handled. That's why we do not need locking information which has not been made persistent.

When the recovery process finds a transaction that has the status of being committed it rolls it forward, which means it copies all files not yet copied to reach a consistent state. This is possible as all files already copied are deleted afterwards. After the transaction has been rolled forward, it is no longer a threat to consistency, thus it can be deleted and the dirty mode can be reset.


Conclusion

With its caching and file system store, Slide features two examples of transactional systems. It has been shown how they work and what their limitations and drawbacks are. Overall, they do a pretty good job compared to what Slide's applications require. With further extensions of Slide, also augmenting the range of possible applications, at the very least, caching might have to be improved to incorporate better isolation. Other extensions include improvements to the scalability of the search engine, better versioning support, general performance enhancements, events and clustering, external transactions, and further ports of the database layer. The Slide project is constantly looking for people interested in one of these areas who are willing to contribute or become committers.


About the Author

Oliver Zeigermann lives in Hamburg, Germany and works for a company called C1 Financial Services GmbH. His favorite technical topics range from parsing and compiler design to transactional, concurrent and database systems. He is the release manager for the 2.0 release of the Jakarta Slide project. Oliver can be reached at oliver@zeigermann.de

PRINTER FRIENDLY VERSION