Back in February of this year when I was an employee of Sleepycat Software, I posted this note to TSS asking for feedback on a new POJO persistence API for Berkeley DB Java Edition:
https://www.theserverside.com/news/thread.tss?thread_id=38916
Since then, I and the rest of us at Sleepycat have become part of Oracle, we have released a beta version of the persistence API, and last week we released the final version. Along the way, the name changed from the Persistence API to the Direct Persistence Layer (DPL). We wanted to distinguish our API clearly from Sun's Java Persistence API (the JPA), since the DPL and the JPA are not identical.
The announcement and website links for the new release of Oracle Berkeley DB Java Edition are here:
http://forums.oracle.com/forums/ann.jspa?annID=321
http://www.oracle.com/database/berkeley-db/je/index.html
http://www.oracle.com/technology/documentation/berkeley-db/je/index.html
What I would like to talk about now is:
* What is Berkeley DB Java Edition and why is there a need for it in the world?
* How is the DPL different from the JPA and why is it different?
To me, all of this is about different branches of evolution. When amphibians started out of the water, a choice was made. From then on evolution led to the best adaptations possible for living in the out-of-water environment. Put the supremely adapted amphibian or land creature back in the open ocean and it won't do very well.
For the software architect, the initial big choices determine the environment (space) the product intends to occupy. Sometimes a product will mutate and occupy a completely different space, but in general products either adapt very well and become the best in their space or they die out.
So once the product space is determined, the goal for us product developers is to adapt and improve our product to make it the best available in that space. For application developers one of the challenges is to pick the right tools, which means determining the type of product that is appropriate (what space to look in) and then choosing the best product in that space.
Oracle Berkeley DB Java Edition is a high performance, transactional B-tree database. It is a very different animal than a relational database. It is in a different product space.
A relational database stack looks something like this:
========================
| Java Persistence API |
------------------------
| JDBC |
------------------------
| Communications Layer |
========================
||
========================
| Communications Layer |
------------------------
| Query Processor |
------------------------
| Storage Manager |
========================
Or in the case of a relational database that is running in the same JVM as the application, as is possible with Derby:
========================
| Java Persistence API |
------------------------
| JDBC |
------------------------
| Query Processor |
------------------------
| Storage Manager |
========================
If an application requires ad-hoc queries and a general purpose database server, then using a relational database is a good choice because it is very well adapted for those needs. In particular, the query processor has evolved over many years and with great effort to provide a logical, standardized view of data while at the same time providing very good performance.
Berkeley DB, on the other hand, looks like this:
===================
| Java API |
-------------------
| Storage Manager |
===================
There is no communications layer because Berkeley DB Java Edition is a library (packaged as a simple jar file) embedded in your application. There is no JDBC layer or query processor because Berkeley DB doesn't have built-in support for ad-hoc queries.
Interestingly, the storage manager is the only component that is in common. In both cases, Btree or similar technology is used to provide storage and indexing of data with concurrency and transactions built in at that level. Storage and indexing technology are highly evolved.
Is a relational database better for some storage needs? Absolutely. If you need a standalone database server, a built-in query language and support for standard APIs, then a relational database is usually the best choice.
Is a relational database the best choice for all storage needs? Absolutely not. If you need to persist Java objects, maximize performance, and reduce the complexity (number of moving parts) in your application environment, then Berkeley DB Java Edition is often the best choice.
By another measure, if your application doesn't really use SQL outside of the ORM solution you're using then Berkeley DB Java Edition and the DPL may be a good choice for you.
Berkeley DB has evolved along a different branch than relational databases have. While the storage management technology may be very similar, with an embedded Btree database library several things are different.
First, an embedded database is designed to work alongside of the application running in the same JVM. It must use only the configured amount of memory, allow many concurrent application threads to access data, and support transactional operations that are under the full control of the application.
Second, a Btree database library should have an API that provides direct access to the storage manager with no compromises in performance. We can imagine that the storage managers of relational databases have such internal APIs, but an embedded database must have a public API that is easy to use and appropriate for the task at hand.
And what is the task at hand for an application using an embedded database? As you can probably guess there is more than one general category of use cases. It is for this reason that Berkeley DB Java Edition has more than one API.
The base API of Berkeley DB Java Edition is intended for use cases where the schema is not tied down to a set of predefined Java classes. For example, in an implementation of an LDAP directory server or a JavaSpace, the schema is user defined. (BTW, both of these applications have been implemented using Berkeley DB Java Edition.) For such applications, our base API provides a fairly low level byte array interface for storing data.
But for many applications, Java classes are used to define the schema, where a Java class is defined to represent each stored entity. The Direct Persistence Layer is provided for such applications.
The DPL is literally a layer on top of the byte array base API. It is higher level in the sense that it takes care of marshaling Java objects to byte arrays and provides a type safe interface based on the Java classes that make up the schema.
But the DPL is also "direct" in that it doesn't add unnecessary overhead. This is where there is a clear difference in intent between the DPL and the JPA. If the DPL provided the same abstractions that the JPA provides for relational databases, it would not be doing its job of providing direct access to the Btree without compromising performance. The result would be something like a fish with legs; it would conform to a standard, but it would not swim as fast.
To understand Berkeley DB and the DPL, it is important to think in terms of a Btree. A Btree is very similar to a java.util.Map. It contains key-value pairs and provides many of the same operations as a Map:
- put a key-value pair
- get a value by key
- iterate over keys and/or values
- perform operations within a range of keys
The difference is that a Btree is a persistent, highly concurrent, transactional data structure.
In addition, a Btree can be a primary index responsible for storing entities as well as indexing them by primary key, or a secondary index responsible for indexing the entities stored in a primary index by another (secondary) key.
This brings us to how Java classes are annotated for use with the DPL, and how this differs from the JPA. Below is a class annotated for use with the DPL.
@Entity
class Person {
@PrimaryKey(sequence="ID")
private long id;
@SecondaryKey(relate=MANY_TO_ONE)
private String name;
private int age;
public Person(String name) {
this.name = name;
}
private Person() {} // needed for deserialization
}
As would be done using the JPA, the class is annotated as an @Entity. This means roughly the same thing for both the JPA and the DPL. An entity is an object that is stored and retrieved separately and has a unique primary key. But instead of @Id in the JPA, the DPL uses @PrimaryKey.
The @SecondaryKey DPL annotation is used in place of the JPA relationship annotations @OneToOne, @ManyToOne, etc. Furthermore, @SecondaryKey causes creation of a secondary index, while @OneToOne, @ManyToOne, etc, do not.
Why the differences? Because when you perform a query with Berkeley DB, you don't use a high level query language that moves the choice of using indexes or not into a query optimizer. With Berkeley DB, queries are performed by accessing Btree indexes directly.
In fact, there is no way to access data except by using an index. As shown below, the primary and secondary keys defined above correspond directly to primary and secondary indexes.
PrimaryIndex personById =
store.getPrimaryIndex(Long.class, Person.class);
SecondaryIndex personByName =
store.getSecondaryIndex(personById, String.class, "name");
Person person = personById.put(new Person("Sally"));
person = personById.get(person.id);
person = personByName.get("Sally");
This direct correspondence between the annotations for an entity and the indexes used to access that entity makes use of the DPL very straightforward.
If you are accustomed to using SQL, you may not be comfortable with the idea of using indexes to perform queries. This means that you write the queries procedurally, using Java. For example, the code below queries Person entities with a name that starts with "H" (the true and false arguments specify whether the key range is inclusive), and have an age under 90, with the results ordered by name:
EntityCursor people =
personByName.entities("H", true, "I", false);
try {
for (Person person : people) {
if (person.age < 90) {
// do something
}
}
} finally {
people.close();
}
This is one of the things that makes Berkeley DB a fish and not a salamander. You have to optimize the query yourself by using indexes appropriately. But you get the benefits of maximum performance, complete control, and a very simple API and environment to work with.
If you think about executing a query using Java, as shown above, you may be concerned that un-marshaling the objects in order to filter on a property, 'age' in this example, will be slower than using a query language. In fact, the DPL was designed with this use case in mind. Marshaling and un-marshaling have been made extremely fast by using generated bytecode rather than Java reflection.
In addition to using indexes to access data as shown above, you can optionally use the standard Java collections framework to access data. For each index, a standard java.util.Map object can be obtained.
I'd like to show one more annotation feature supported by the DPL that is also supported by relational databases:
@Entity
class Person {
...
@SecondaryKey(relate=MANY_TO_ONE, relatedEntity=Employer.class)
private long employerId;
...
}
When you define a @SecondaryKey you can specify that it is a foreign key with the relatedEntity property. This causes foreign key constraints to be enforced, just like in a relational database. You can also specify an "on delete" action as you can with a relational database.
So with Berkeley DB in some ways you are working at a lower level, particularly when performing complex queries. But on the other hand many of the high level features in a relational database are available, such as foreign key constraints.
In addition there are several features of the DPL that are possible because there is no relational database in the picture and because Berkeley DB Java Edition is implemented natively using Java:
+ Arbitrary Java types may be used, such as enums, arrays, collections and embedded objects. These are not stored in a separate table or index, they are marshaled as part of the entity object's value.
+ For a given stored entity, references in the object graph at the time the object is stored are preserved when the entity is later retrieved. You may be familiar with this feature if you have used the built-in Java object serialization.
+ Entity classes may have subclasses and superclasses, as may embedded object classes, and polymorphism works as expected in Java. For example, a secondary key may be defined on an entity subclass and an index for that key will contain only instances of that subclass.
+ Class evolution is fully supported. Fields and classes may be added, removed, renamed, or converted by a custom conversion method. Evolution of instance data is performed lazily as it is read, or eagerly using an explicit conversion method.
The last feature I"d like to point out is that the use of annotations is optional. If you have another source of metadata (primary and secondary key information) you can implement your own EntityModel to supply your metadata to the DPL. The metadata could come from anywhere you choose -- it could be loaded from XML files, derived by using naming conventions, etc. This takes POJO persistence one step further since without the use of DPL annotations, the DPL packages do not even need to be present to compile your persistent classes or load them at runtime.
Hopefully I have made the differences between the DPL and the JPA clear and have made a case for why these differences exist. Please post any questions or comments you have. If you would like to try out the Berkeley DB Java Edition product, please go ahead and download it:
http://www.oracle.com/database/berkeley-db/index.html
If you have usage questions or comments, the product discussion forum is a good place to start:
http://forums.oracle.com/forums/forum.jspa?forumID=273