Enterprise Java Community: A Scalable, Transactional Data Store for Web 2.0 Services

Global online services such as Amazon, eBay, Myspace, YouTube, and Google serve millions of customers through tens of thousands of servers located world-wide. On this immense scale, components fail continuously and it is very difficult to maintain a consistent state while at the same time hiding failures from the application.

Global online services such as Amazon, eBay, Myspace, YouTube, and Google serve millions of customers through tens of thousands of servers located world-wide. On this immense scale, components fail continuously and it is very difficult to maintain a consistent state while at the same time hiding failures from the application.

Peer-to-peer protocols achieve self-management by replicating services among peers, but these are mostly limited to write-once/read-many data sharing. To extend them beyond typical file sharing, the support of fast transactions on distributed hash tables (DHTs) is an important, and until now, elusive piece of functionality.

The Scalaris system, described below, provides a comprehensive solution for self managing, scalable data management. Scalaris and similar systems may be an important core service of future cloud computing environments.

As a common key aspect, all Web 2.0 services have to deal with concurrent data updates. Typical examples are checking the availability of products and their prices, purchasing items and putting them into virtual shopping carts, and updating the state in multi-player online games. Clearly, many of these data operations have to be atomic, consistent, isolated and durable (ACID). Traditional centralized database systems are ill-suited for this task, sooner or later they become a bottleneck for business workflow. Rather, a scalable, transactional data store like Scalaris is what is needed.

Scalaris Key/Value Store

We set out to build a distributed key/value store capable of serving thousands or even millions of concurrent data accesses per second. Providing strong data consistency in the face of node crashes and hefty concurrent write accesses was one of our major goals.

With the Scalaris system, we do not attempt to replace current database management systems with their general, full-fledged SQL interfaces. Instead our target is to support transactional Web 2.0 services like those needed for Internet shopping, banking, or multi-player online games. Our system consists of three layers:

  • At the bottom, an enhanced structured overlay network, with logarithmic routing performance, provides the basis for storing and retrieving keys and their corresponding values. In contrast to many other overlays, our implementation stores the keys in lexicographical order. Lexicographical ordering instead of random hashing enables control of data placement which is necessary for low latency access in multi-datacenter environments.
  • The middle layer implements data replication. It enhances the availability of data even under harsh conditions such as node crashes and physical network failures.
  • The top layer provides transactional support for strong data consistency in the face of concurrent data operations. It uses a fast consensus protocol with low communication overhead that has been optimally embedded into the structured overlay.

These three layers are all implemented in Erlang. Together, they provide a distributed key/value store as a scalable and highly available service which is an important building block for Web 2.0 applications.

Why use Erlang?

Erlang is a functional programming language. It provides only write-once variables, and is therefore often cited as the ideal programming language for implementing parallel algorithms. However, Erlang is more than simply a great programming language – it also makes it very easy to convert parallel programs into distributed ones, because the concurrency model is based on message passing instead of shared state. By distributing the individual threads or processes over several nodes, a parallel program becomes a distributed one.

Erlang's message passing style fits the programming abstractions used in research on distributed systems. This enabled us to directly translate our abstract algorithms for transactions and the P2P layer into Erlang code. The whole transaction framework comprises only about 2,000 lines of code.

And Erlang's asynchronous message passing style is not the whole story either. Another powerful feature is its standard library including OTP (Open Telecom Platform) which provides many useful abstractions for coping with failures. One example is the supervisors which are used to monitor and restart processes in case of node crashes.

Transaction API for Java and Erlang

The following Java code snippet illustrates the common bank account example where money is transferred from account A to B. The money transfer is bracketed within a transaction, thereby ensuring atomicity and isolation from other transactions.

   // new Transaction object
   Transaction transaction = new Transaction();

   // start new transaction

   // read account A
   int account_A = new Integer(transaction.read("account_A")).intValue();
   // read account B
   int account_B = new Integer(transaction.read("account_B")).intValue();

   // remove 100$ from account A
   transaction.write("account_A", new Integer(account_A - 100).toString());
   // add 100$ to account B
   transaction.write("account_B", new Integer(account_B + 100).toString());

   // commit

The corresponding Erlang code is slightly more complex, because of the explicit handling of transaction states. The Erlang example below performs the same money transfer as the Java code.

TFun =
  fun(TransLog) ->
    % read balance
    {MyBalance, TransLog1}    = read2(TransLog,  MyAccount),
    {OtherBalance, TransLog2} = read2(TransLog1, OtherAccount),
    % update balance
    TransLog3 = write2(TransLog2, MyAccount,    MyBalance - 100),
    TransLog4 = write2(TransLog3, OtherAccount, OtherBalance + 100),
    {ok, TransLog4}

SuccessFun = fun(X) ->
       {success, X}
FailureFun = fun(Reason) ->
       {failure, Reason}

% execute transaction
do_transaction(TFun, SuccessFun, FailureFun).

The Erlang transaction API is more powerful and it exposes all functionality, including the asynchronous execution. Transactions are expressed as anonymous functions, i.e. pointers to functions like TFun. TFun is a function, which, when executed, records all reads and writes in a transaction log.

The function do_transaction first calls TFun to gather the transaction log (read phase) and then tries to atomically commit the recorded changes to the system (commit phase). If a concurrent write operation on one of the involved items is detected, the transaction is aborted and the user defined FailureFun is executed. Otherwise the given SuccessFun is called.

The Java API, in contrast, exposes only a subset of the functionality but is more convenient to use.

Demo Application

As a demonstrator application we implemented a subset of the Wikipedia functionality with Scalaris as the database. The user-facing webservers are standard Java Servlet containers which implement the application logic and render the WikiText to HTML.

The database backend implements a thin layer for mapping the Wikipedia SQL tables to our data model. The Scalaris data model is essentially equivalent to a Map<String,String> with support for range queries. For Wikipedia we had to split its relational scheme into key-value pairs.


We tested the performance of Scalaris on an Intel cluster. Each node has two Quad-Core E5420s running at 2.5 GHz and 16GM of main memory. On each physical cluster node, we ran one Erlang virtual machine and 16 Scalaris nodes. The graphs show the performance of modify operations (top graph) and read operations (bottom graph) with a replication degree of four. The read operation reads a majority of the replicas of one key-value pair while the modify operation performs a read-modify-write cycle on a key-value pair in a transaction.

The graphs show the number of threads executing the benchmark per node and the number of cluster nodes used. Note, that for the 100-thread-case, there are actually up to 16*100 threads issuing modify transactions concurrently. The read as well as the modify scale almost linearly with the number of nodes. To achieve the best performance, more concurrent data accesses are needed in the modify operation compared to the read operation.

Additional Information

The Scalaris code is open source. It is available at https://code.google.com/p/scalaris/ . Additional information (papers, videos) can be found at http://www.onscale.de .

Dig Deeper on Java Web services

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.