Oracle Berkeley DB Java Edition 3.1.0: Direct Persistence Layer

Discussions

News: Oracle Berkeley DB Java Edition 3.1.0: Direct Persistence Layer

  1. Back in February of this year when I was an employee of Sleepycat Software, I posted this note to TSS asking for feedback on a new POJO persistence API for Berkeley DB Java Edition: http://www.theserverside.com/news/thread.tss?thread_id=38916 Since then, I and the rest of us at Sleepycat have become part of Oracle, we have released a beta version of the persistence API, and last week we released the final version. Along the way, the name changed from the Persistence API to the Direct Persistence Layer (DPL). We wanted to distinguish our API clearly from Sun's Java Persistence API (the JPA), since the DPL and the JPA are not identical. The announcement and website links for the new release of Oracle Berkeley DB Java Edition are here: http://forums.oracle.com/forums/ann.jspa?annID=321 http://www.oracle.com/database/berkeley-db/je/index.html http://www.oracle.com/technology/documentation/berkeley-db/je/index.html What I would like to talk about now is: * What is Berkeley DB Java Edition and why is there a need for it in the world? * How is the DPL different from the JPA and why is it different? To me, all of this is about different branches of evolution. When amphibians started out of the water, a choice was made. From then on evolution led to the best adaptations possible for living in the out-of-water environment. Put the supremely adapted amphibian or land creature back in the open ocean and it won't do very well. For the software architect, the initial big choices determine the environment (space) the product intends to occupy. Sometimes a product will mutate and occupy a completely different space, but in general products either adapt very well and become the best in their space or they die out. So once the product space is determined, the goal for us product developers is to adapt and improve our product to make it the best available in that space. For application developers one of the challenges is to pick the right tools, which means determining the type of product that is appropriate (what space to look in) and then choosing the best product in that space. Oracle Berkeley DB Java Edition is a high performance, transactional B-tree database. It is a very different animal than a relational database. It is in a different product space. A relational database stack looks something like this: ======================== | Java Persistence API | ------------------------ | JDBC | ------------------------ | Communications Layer | ======================== || ======================== | Communications Layer | ------------------------ | Query Processor | ------------------------ | Storage Manager | ======================== Or in the case of a relational database that is running in the same JVM as the application, as is possible with Derby: ======================== | Java Persistence API | ------------------------ | JDBC | ------------------------ | Query Processor | ------------------------ | Storage Manager | ======================== If an application requires ad-hoc queries and a general purpose database server, then using a relational database is a good choice because it is very well adapted for those needs. In particular, the query processor has evolved over many years and with great effort to provide a logical, standardized view of data while at the same time providing very good performance. Berkeley DB, on the other hand, looks like this: =================== | Java API | ------------------- | Storage Manager | =================== There is no communications layer because Berkeley DB Java Edition is a library (packaged as a simple jar file) embedded in your application. There is no JDBC layer or query processor because Berkeley DB doesn't have built-in support for ad-hoc queries. Interestingly, the storage manager is the only component that is in common. In both cases, Btree or similar technology is used to provide storage and indexing of data with concurrency and transactions built in at that level. Storage and indexing technology are highly evolved. Is a relational database better for some storage needs? Absolutely. If you need a standalone database server, a built-in query language and support for standard APIs, then a relational database is usually the best choice. Is a relational database the best choice for all storage needs? Absolutely not. If you need to persist Java objects, maximize performance, and reduce the complexity (number of moving parts) in your application environment, then Berkeley DB Java Edition is often the best choice. By another measure, if your application doesn't really use SQL outside of the ORM solution you're using then Berkeley DB Java Edition and the DPL may be a good choice for you. Berkeley DB has evolved along a different branch than relational databases have. While the storage management technology may be very similar, with an embedded Btree database library several things are different. First, an embedded database is designed to work alongside of the application running in the same JVM. It must use only the configured amount of memory, allow many concurrent application threads to access data, and support transactional operations that are under the full control of the application. Second, a Btree database library should have an API that provides direct access to the storage manager with no compromises in performance. We can imagine that the storage managers of relational databases have such internal APIs, but an embedded database must have a public API that is easy to use and appropriate for the task at hand. And what is the task at hand for an application using an embedded database? As you can probably guess there is more than one general category of use cases. It is for this reason that Berkeley DB Java Edition has more than one API. The base API of Berkeley DB Java Edition is intended for use cases where the schema is not tied down to a set of predefined Java classes. For example, in an implementation of an LDAP directory server or a JavaSpace, the schema is user defined. (BTW, both of these applications have been implemented using Berkeley DB Java Edition.) For such applications, our base API provides a fairly low level byte array interface for storing data. But for many applications, Java classes are used to define the schema, where a Java class is defined to represent each stored entity. The Direct Persistence Layer is provided for such applications. The DPL is literally a layer on top of the byte array base API. It is higher level in the sense that it takes care of marshaling Java objects to byte arrays and provides a type safe interface based on the Java classes that make up the schema. But the DPL is also "direct" in that it doesn't add unnecessary overhead. This is where there is a clear difference in intent between the DPL and the JPA. If the DPL provided the same abstractions that the JPA provides for relational databases, it would not be doing its job of providing direct access to the Btree without compromising performance. The result would be something like a fish with legs; it would conform to a standard, but it would not swim as fast. To understand Berkeley DB and the DPL, it is important to think in terms of a Btree. A Btree is very similar to a java.util.Map. It contains key-value pairs and provides many of the same operations as a Map: - put a key-value pair - get a value by key - iterate over keys and/or values - perform operations within a range of keys The difference is that a Btree is a persistent, highly concurrent, transactional data structure. In addition, a Btree can be a primary index responsible for storing entities as well as indexing them by primary key, or a secondary index responsible for indexing the entities stored in a primary index by another (secondary) key. This brings us to how Java classes are annotated for use with the DPL, and how this differs from the JPA. Below is a class annotated for use with the DPL. @Entity class Person { @PrimaryKey(sequence="ID") private long id; @SecondaryKey(relate=MANY_TO_ONE) private String name; private int age; public Person(String name) { this.name = name; } private Person() {} // needed for deserialization } As would be done using the JPA, the class is annotated as an @Entity. This means roughly the same thing for both the JPA and the DPL. An entity is an object that is stored and retrieved separately and has a unique primary key. But instead of @Id in the JPA, the DPL uses @PrimaryKey. The @SecondaryKey DPL annotation is used in place of the JPA relationship annotations @OneToOne, @ManyToOne, etc. Furthermore, @SecondaryKey causes creation of a secondary index, while @OneToOne, @ManyToOne, etc, do not. Why the differences? Because when you perform a query with Berkeley DB, you don't use a high level query language that moves the choice of using indexes or not into a query optimizer. With Berkeley DB, queries are performed by accessing Btree indexes directly. In fact, there is no way to access data except by using an index. As shown below, the primary and secondary keys defined above correspond directly to primary and secondary indexes. PrimaryIndex personById = store.getPrimaryIndex(Long.class, Person.class); SecondaryIndex personByName = store.getSecondaryIndex(personById, String.class, "name"); Person person = personById.put(new Person("Sally")); person = personById.get(person.id); person = personByName.get("Sally"); This direct correspondence between the annotations for an entity and the indexes used to access that entity makes use of the DPL very straightforward. If you are accustomed to using SQL, you may not be comfortable with the idea of using indexes to perform queries. This means that you write the queries procedurally, using Java. For example, the code below queries Person entities with a name that starts with "H" (the true and false arguments specify whether the key range is inclusive), and have an age under 90, with the results ordered by name: EntityCursor people = personByName.entities("H", true, "I", false); try { for (Person person : people) { if (person.age < 90) { // do something } } } finally { people.close(); } This is one of the things that makes Berkeley DB a fish and not a salamander. You have to optimize the query yourself by using indexes appropriately. But you get the benefits of maximum performance, complete control, and a very simple API and environment to work with. If you think about executing a query using Java, as shown above, you may be concerned that un-marshaling the objects in order to filter on a property, 'age' in this example, will be slower than using a query language. In fact, the DPL was designed with this use case in mind. Marshaling and un-marshaling have been made extremely fast by using generated bytecode rather than Java reflection. In addition to using indexes to access data as shown above, you can optionally use the standard Java collections framework to access data. For each index, a standard java.util.Map object can be obtained. I'd like to show one more annotation feature supported by the DPL that is also supported by relational databases: @Entity class Person { ... @SecondaryKey(relate=MANY_TO_ONE, relatedEntity=Employer.class) private long employerId; ... } When you define a @SecondaryKey you can specify that it is a foreign key with the relatedEntity property. This causes foreign key constraints to be enforced, just like in a relational database. You can also specify an "on delete" action as you can with a relational database. So with Berkeley DB in some ways you are working at a lower level, particularly when performing complex queries. But on the other hand many of the high level features in a relational database are available, such as foreign key constraints. In addition there are several features of the DPL that are possible because there is no relational database in the picture and because Berkeley DB Java Edition is implemented natively using Java: + Arbitrary Java types may be used, such as enums, arrays, collections and embedded objects. These are not stored in a separate table or index, they are marshaled as part of the entity object's value. + For a given stored entity, references in the object graph at the time the object is stored are preserved when the entity is later retrieved. You may be familiar with this feature if you have used the built-in Java object serialization. + Entity classes may have subclasses and superclasses, as may embedded object classes, and polymorphism works as expected in Java. For example, a secondary key may be defined on an entity subclass and an index for that key will contain only instances of that subclass. + Class evolution is fully supported. Fields and classes may be added, removed, renamed, or converted by a custom conversion method. Evolution of instance data is performed lazily as it is read, or eagerly using an explicit conversion method. The last feature I"d like to point out is that the use of annotations is optional. If you have another source of metadata (primary and secondary key information) you can implement your own EntityModel to supply your metadata to the DPL. The metadata could come from anywhere you choose -- it could be loaded from XML files, derived by using naming conventions, etc. This takes POJO persistence one step further since without the use of DPL annotations, the DPL packages do not even need to be present to compile your persistent classes or load them at runtime. Hopefully I have made the differences between the DPL and the JPA clear and have made a case for why these differences exist. Please post any questions or comments you have. If you would like to try out the Berkeley DB Java Edition product, please go ahead and download it: http://www.oracle.com/database/berkeley-db/index.html If you have usage questions or comments, the product discussion forum is a good place to start: http://forums.oracle.com/forums/forum.jspa?forumID=273
  2. There was no real need for a new API. JDO is there for this purpose. Maybe it took too much effort to develop (mainly) a proper JDOQL mapping. Or maybe the new vendor has different interests. Guido.
  3. Why the differences? Because when you perform a query with Berkeley DB, you don't use a high level query language that moves the choice of using indexes or not into a query optimizer. With Berkeley DB, queries are performed by accessing Btree indexes directly.
    I think the answer to your guess is there in original post, Guido. BDB API is lower than any query language. If you need to do queries, you would be better off using full-fledged database.
  4. Why the differences? Because when you perform a query with Berkeley DB, you don't use a high level query language that moves the choice of using indexes or not into a query optimizer. With Berkeley DB, queries are performed by accessing Btree indexes directly.


    I think the answer to your guess is there in original post, Guido.

    BDB API is lower than any query language. If you need to do queries, you would be better off using full-fledged database.
    I don't think it is impossible to map JDOQL to a mix of index access and in-memory filtering/processing, provided that sufficient metadata are defined. Surely is not straightforward but..... And JDO is not only a query engine. Guido.
  5. I'm not saying it's impossible. I'm saying that the beauty and value of BDB API is that you are exposed to 'raw' indices and records with unsurpassed speed, and for some tasks it is just what you need. If you implement query language on top of BDB, you'd get something like Cloudscape/Derby. It's already there. :)
  6. It's a common theme now.[ Go to top ]

    While BDB is not distributed, I think you'll see many distributed state managers headed in the same direction. ObjectGrid already had index support and index based query APIs for retrieving objects stored within it. The stack picture showing the components comparing bdb to a normal database is very similar to ObjectGrid. There's nothing to stop someone implementing a query engine on top of those indexes, you just pay the cost of query processing rather than a more complex coding exercise to do it with indexes. Jofti does this already for caches. You use it, you pay the runtime cost of that simplicity. I think this is an exciting space right now with opportunities for differenciating. I know we're working hard on the next version of ObjectGrid in this kind of direction. You can download OG from IBMs site.
  7. I'm not saying it's impossible.

    I'm saying that the beauty and value of BDB API is that you are exposed to 'raw' indices and records with unsurpassed speed, and for some tasks it is just what you need.

    If you implement query language on top of BDB, you'd get something like Cloudscape/Derby. It's already there. :)
    Again JDO is the solution. You can get the raw "connection" from PersistenceManager in order to access low level features. But the advantage is that you can use a standard interface to add/remove objects from the store, to manage transaction etc. Well, let's say that I don't think BDB API are with no value, but providing a JDO wrapper would have been a super-extra value. Maybe in BDB 4 ? Guido
  8. The Direct Persistence Layer is super easy and straight to use and BDB integrates well with JTX. BDB is much easier than any of the persistence frameworks I have seen so far. I don't see any advantage in adding the complexity of JDO on top of it. The only two things I'd like to see in 4.0 is - replication on which AFAIK the BDB team is working (maybe even some support for partitioning) - adding interceptor hooks to the EntityStore (onSave onLoad etc) Thanks for the greate work.
  9. Re: Interceptor hooks[ Go to top ]

    Hi Christian, Thanks for the positive comments! Yes, you're correct that we consider replication to be a high priority feature. On interceptor hooks, can you give more details about what you're looking for? By onSave/onLoad are you looking for triggers whenever an entity is inserted, updated or deleted? I can think of two possible reasons that triggers would be useful: 1) Convenience. Instead of having to notify a trigger method yourself whenever you make a change, the system would do that for you. 2) Notification of cascading deletes and updates. When you use onRelatedEntityDelete=CASCADE or NULLIFY, the system is performing a delete or update behind the scenes when the related entity is deleted. Triggers would notify your application when this happens. Is convenience the reason you'd like this, or do you need to know about cascading deletes and updates, or both? Thanks, Mark
  10. Re: JDO implementation[ Go to top ]

    Hi Guido, Thanks for your comments. We don't currently have any plans for a JDO implementation. At one time there seemed to be some demand for this by those evaluating our product, but not recently. JDO is quite a big chunk of work, so we won't do it unless there is strong demand. As mentioned by another poster, we currently see very strong demand for replication. Mark
  11. use SQL to query Java objects[ Go to top ]

    This project may help if one wants to use SQL to search and retrieve Java objects in collections: http://josql.sourceforge.net/
  12. Re: JDO implementation[ Go to top ]

    Hi Guido,

    Thanks for your comments. We don't currently have any plans for a JDO implementation. At one time there seemed to be some demand for this by those evaluating our product, but not recently. JDO is quite a big chunk of work, so we won't do it unless there is strong demand. As mentioned by another poster, we currently see very strong demand for replication.

    Mark
    Yes, I remember some posts related the form of the new API proposal and JDO was suggested because of its storage agnosticism by design. I know that a full JDO2 implementation is a big work because of many added features wrt JDO 1.x. Anyway, I think that several concepts in JDO can be applied to any storage engine and giving it a JDO clothe, not necessairly with a strict TCK compliance, I think is a value. But, OK, the first thing is a consistent working interface and strong operational functionalities. Guido
  13. Amazon.com opensource![ Go to top ]

    Why not use http://carbonado.sourceforge.net/? Amazon.com opensource!