Java Development News:

Binding XML to Java

By Ed Merks and Elena Litani

01 Aug 2006 | TheServerSide.com

Introduction

Manipulating XML data easily and efficiently in Java remains an important problem. Numerous approaches to XML binding exist in the industry, including DOM, JAXB, XML Beans, Castor, SDO and so on. In this article we will explore how the Eclipse Modeling Framework, EMF, solves the XML binding problem in a number of interesting ways, and we'll compare that to the alternatives.

The model that is used to represent models in EMF is called Ecore, and since Ecore is itself a model, it is called a meta model, i.e., the model of a model. EMF supports this core meta model API, Ecore, analogous to XML Schema, as well as a core instance data model API, EObject, analogous to DOM Node. Ecore is to abstract syntax what XML Schema is to concrete syntax, i.e., a unifying meta model. But rather start with vague abstractions, it seems best to start from something well known and concrete on which to draw comparisons.

To bring concreteness to the discussion, we explore the binding problem by way of an example. Consider the problem of creating the following XML instance using W3C DOM, i.e., org.w3c.dom.*.

 <?xml version="1.0" encoding="UTF-8"?>
 <tree:rootNode xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree"
  label="root">
  <tree:childNode label="text">text</tree:childNode>
  <tree:childNode label="comment"><!--comment--></tree:childNode>
  <tree:childNode label="cdata"><![CDATA[<cdata>]]></tree:childNode>
 </tree:rootNode>

Download this zip file for the complete source code for this article.

We will first look at how to use DOM to create the sample instance, serialize it, load it, and traverse it. Then we will look at how to use EMF to do the analogous things, and reflect on the differences and similarities. After that, we will explore how XML Schema helps bring structure and order to the binding problem by helping set expectations about what meaningful content may appear. EMF provides the XSD model to represent XML Schema instances, and it provides support for mapping XSD instances to Ecore instances.

We'll demonstrate how this can be used for processing data instances in a more structured way.

Given that EMF supports generating a Java API from Ecore, the mapping from XSD to Ecore provides an implicit mapping onto Java. The Eclipse EMF website XML tutorial shows how to generate a Java implementation starting with a schema. We will look at how that generated API can be used to process an instance and at how EMF's flexible resource model supports multiple cross linked documents. For the stake of brevity, we will ignore exception handling.

Creating and serializing DOM

As in all systems, DOM requires steps to set up an environment. It begins with the creation of a document builder factory, which can be configured in interesting ways; for our purposes, it should be namespace aware. The document builder factory is used to create a document builder which in turn is used to create the desired document that is the starting point for building our DOM instance.

  DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
  documentBuilderFactory.setNamespaceAware(true);
  DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
  Document document = documentBuilder.newDocument();

Using the document, we create a root element with the appropriate namespace and prefix and add it as the root document element.

  final String NAMESPACE_URI = "http://www.eclipse.org/emf/example/dom/Tree";
  final String NAMESPACE_PREFIX = "tree";
  ...
  Element rootTreeNode = 
   document.createElementNS(NAMESPACE_URI, NAMESPACE_PREFIX + ":rootNode");
  document.appendChild(rootTreeNode);

By default, DOM does not add the necessary namespace declarations needed to make the document namespace well-formed, so we need to add the following namespace declaration:

  rootTreeNode.setAttributeNS
    ("http://www.w3.org/2000/xmlns/", "xmlns:" + NAMESPACE_PREFIX, NAMESPACE_URI);

The label attribute of the root is set to the appropriate value as follows:

  rootTreeNode.setAttributeNS(null, "label", "root");

Next we create and add all the children. In order to get the formatting exactly as in the example, we'll add all the formatting ourselves, rather than rely on indented formatting while printing. We create the first child element, append indentation to the root, append the child to the root, set the label of the child, and append a regular text node to the child.

  Element textChildTreeNode = 
    document.createElementNS(NAMESPACE_URI, NAMESPACE_PREFIX + ":childNode");
  rootTreeNode.appendChild(document.createTextNode("n "));
  rootTreeNode.appendChild(textChildTreeNode);
  textChildTreeNode.setAttributeNS(null, "label", "text");
  textChildTreeNode.appendChild(document.createTextNode("text"));

The subsequent children are created in the same way, but using a comment node or CDATA node.

  Element commentChildTreeNode = 
    document.createElementNS(NAMESPACE_URI, NAMESPACE_PREFIX + ":childNode");
  rootTreeNode.appendChild(document.createTextNode("n "));
  rootTreeNode.appendChild(commentChildTreeNode);
  commentChildTreeNode.setAttributeNS(null, "label", "comment");
  commentChildTreeNode.appendChild(document.createComment("comment"));
  
  Element cdataChildTreeNode = 
    document.createElementNS(NAMESPACE_URI, NAMESPACE_PREFIX + ":childNode");
  rootTreeNode.appendChild(document.createTextNode("n "));
  rootTreeNode.appendChild(cdataChildTreeNode);
  cdataChildTreeNode.setAttributeNS(null, "label", "cdata");
  cdataChildTreeNode.appendChild(document.createCDATASection("<cdata>"));

Finally, to ensure that the closing XML delimiter of the root element is on a new line, we append a line break.

  rootTreeNode.appendChild(document.createTextNode("n"));

To serialize the document we have just created, we must create a transformer factory and use that to create the transformer which does the job:

  TransformerFactory transformerFactory = TransformerFactory.newInstance();
  Transformer transformer = transformerFactory.newTransformer();
  transformer.transform(new DOMSource(document), new StreamResult(System.out));

Loading and traversing DOM

Loading an instance requires much the same setup but rather than using the document builder to create a blank new document we use the builder to parse a document from a file.

String DATA_FOLDER="c:/data/";
 ...
 DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
 DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
 Document document = 
  documentBuilder.parse(new File(DATA_FOLDER + "DOMTreeNode.xml"));

A recursive method can be used to traverse the element structure of the document. Starting with the document root element, we traverse each element, printing the namespace, name, and label attributes, and visiting each type of child node:

  new Object()
  {
   public void traverse(String indent, Element element)
   {
    System.out.println(indent + 
  "{" + element.getNamespaceURI() + "}"
   + element.getLocalName());
    System.out.println(indent + " label=" 
  + element.getAttributeNS(null, "label"));
    for (Node child = element.getFirstChild(); child 
  != null; child = child.getNextSibling())
    {
     // Consider each type of child.
     //
     switch (child.getNodeType())
     {
      case Node.TEXT_NODE:
      {
       System.out.println(indent + " 
    '" + child.getNodeValue().replaceAll("n", "n") + "'");
       break;
      }
      case Node.COMMENT_NODE:
      {
       System.out.println(indent + " <!--" + child.getNodeValue() + "-->");
       break;
      }
      case Node.CDATA_SECTION_NODE:
      {
       System.out.println(indent + " <![CDATA[" 
    + child.getNodeValue() + "]]>");
       break;
      }
      case Node.ELEMENT_NODE:
      {
       traverse(indent + " ", (Element)child);
       break;
      }
     }
    }
   }
  }.traverse("", document.getDocumentElement());

It prints out the following results for the sample instance

  {http://www.eclipse.org/emf/example/dom/Tree}rootNode
   label=root
   'n '
   {http://www.eclipse.org/emf/example/dom/Tree}childNode
    label=text
    'text'
   'n '
   {http://www.eclipse.org/emf/example/dom/Tree}childNode
    label=comment
    <!--comment-->
   'n '
   {http://www.eclipse.org/emf/example/dom/Tree}childNode
    label=cdata
    <![CDATA[<cdata>]]>
   'n'

DOM is very simple, but also very low level.

Creating and serializing EMF AnyType

Having looked at how to use DOM to create the sample instance, serialize it, load it, and traverse it, we'll now look at how to use EMF to do the analogous things. First let's briefly consider some of the EMF APIs that will be needed. EMF has the concept of Resource, i.e., a container for your data. A resource usually represents a file, or a URL link, or any other physical or logical holder for your data. Related resources are typically stored in the same ResourceSet which provides a context for resolving references among the resources, e.g., if there is a cross reference between two instance documents, such documents will be loaded into a common resource set.

As we mentioned earlier, EMF maps XML Schema to Ecore. Since some of the XML Schema-specific constructs (such as wildcards) are not represented directly in Ecore, such information is added to Ecore using Ecore annotations. These annotations are referred to as extended meta data. EMF provides an API (ExtendedMetaData) to query, modify or create this additional meta data.

Now we are ready to setup our environment. First, we create a resource set which will hold and provide access our data and meta data. We also create a local instance of ExtendedMetaData, and register it so that this one instance will be used when loading data into the resource set. Note that this local instance of extended meta data will cache and provide access to all the meta data created during life-time of this object:

  ResourceSet resourceSet = new ResourceSetImpl();
  final ExtendedMetaData extendedMetaData =
     new BasicExtendedMetaData(resourceSet.getPackageRegistry());
  resourceSet.getLoadOptions().put
    (XMLResource.OPTION_EXTENDED_META_DATA, extendedMetaData);

Since EMF is strictly modeled, all data needs to have a schema that defines that data. Given that in the current example, data can contain anything---element, CDATA section, comment ---we can demand create a simple Ecore schema that would correspond to the following XML Schema:

  <xsd:schema xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
     targetNamespace="http://www.eclipse.org/emf/example/dom/Tree">
   <xsd:element name="rootNode" type="xsd:anyType"/>
  </xsd:schema>

First let's very briefly consider the basic structure of an Ecore model. At the root there is a package which has an nsURI corresponding to a schema's namespace, and contains classifiers corresponding to the types in a schema. Each classifier is either a class or a data type, corresponding to either a complex type or a simple type in the schema. A class has a name and contains features. Instances of a class are of type EObject. Each feature of a class, corresponding to an element or attribute in the schema, is either a reference or an attribute, depending on whether the type of the element or attribute is complex or simple. I.e., references have class types and attributes have data types. Data type instances are serialized by conversion to and from a simple string representation, unlike class instances, i.e., EObject instances, which need a hierarchical element structure to serialize all their complex parts. All the data of an EObject instance are accessible reflectively, e.g., eGet(feature) fetches a value, eSet(feature, value) updates a value, and eClass() provides access to the set of available features.

Now let's create a feature to represent the global root element declaration. We do this by using extended meta data to demand the following feature:

  EStructuralFeature rootNodeFeature = 
    extendedMetaData.demandFeature(NAMESPACE_URI, "rootNode", true);

A feature is always defined within a containing class, and for a feature corresponding to global element (or attribute) declaration, this will be a special class referred to as a document root. We need access to that class in order to create an instance to hold our root element.

  EClass documentRootClass = rootNodeFeature.getEContainingClass();

As with any class, we can create an instance from it. This instance is directly analogous to a DOM document.

  EObject documentRoot = EcoreUtil.create(documentRootClass);

A document root has a special feature to hold the xmlns prefixes, so we use that to fetch the map and to set the prefix we want for our sample instance.

  EMap xmlnsPrefixMap = 
    (EMap)documentRoot.eGet
      (extendedMetaData.getXMLNSPrefixMapFeature(documentRootClass));
  xmlnsPrefixMap.put(NAMESPACE_PREFIX, NAMESPACE_URI);

Next we use the factory corresponding to the XML Schema namespace to create an AnyType instance, which is the mapping for the XML Schema's anyType and provides effectively the equivalent data API as DOM., and we add it to the document. The AnyType instance is much like a DOM element, but, as we'll see later, it is more properly thought of as an instance of a complex type.

  AnyType rootTreeNode = XMLTypeFactory.eINSTANCE.createAnyType();
  documentRoot.eSet(rootNodeFeature, rootTreeNode);

To set the label feature, we demand a feature that corresponds to a global attribute declaration with the appropriate namespace and name, and use that to set the label feature to the appropriate value.

  EStructuralFeature labelAttribute = extendedMetaData.demandFeature(null, "label", false);
  rootTreeNode.eSet(labelAttribute, "root");

Since the AnyType has mixed content, it has an EMF feature map representing that mixed content. A feature map is effectively a list of feature value pairs designed to help support things like mixed content and substitution groups. We'll need access to that feature map to add any content.

  FeatureMap rootMixed = rootTreeNode.getMixed();

To create a child tree node with the correct element name, we demand yet another element-based feature, and then we create the first child, append indentation to the root, append the child to the root, set the label of the child, and finally append text to the child.

  EStructuralFeature childNodeFeature = 
   extendedMetaData.demandFeature(NAMESPACE_URI, "childNode", true);
  AnyType textChildTreeNode = XMLTypeFactory.eINSTANCE.createAnyType();
  FeatureMapUtil.addText(rootMixed, "n ");
  rootMixed.add(childNodeFeature, textChildTreeNode);
  textChildTreeNode.eSet(labelAttribute, "text");
  FeatureMapUtil.addText(textChildTreeNode.getMixed(), "text");

The subsequent children are created in the same way, but using a comment feature or CDATA feature:

  AnyType commentChildTreeNode = XMLTypeFactory.eINSTANCE.createAnyType();
  FeatureMapUtil.addText(rootMixed, "n ");
  rootMixed.add(childNodeFeature, commentChildTreeNode);
  commentChildTreeNode.eSet(labelAttribute, "comment");
  FeatureMapUtil.addComment(commentChildTreeNode.getMixed(), "comment");
  
  AnyType cdataChildTreeNode = XMLTypeFactory.eINSTANCE.createAnyType();
  FeatureMapUtil.addText(rootMixed, "n ");
  rootMixed.add(childNodeFeature, cdataChildTreeNode);
  cdataChildTreeNode.eSet(labelAttribute, "cdata");
  FeatureMapUtil.addCDATA(cdataChildTreeNode.getMixed(), "<cdata>");

As with DOM, to ensure that the closing delimiter will begin on a new line, we add a line break.

  FeatureMapUtil.addText(rootMixed, "n");

To serialize the instance, we register an appropriate default resource factory. Here we choose to use GenericXMLResourceFactoryImpl which already sets the options needed to process any XML document. After registering the resource factory, we need to create a resource to hold our instance, add the instance to the resource, and save it.

  resourceSet.getResourceFactoryRegistry().getExtensionToFactoryMap().put
    (Resource.Factory.Registry.DEFAULT_EXTENSION, 
     new GenericXMLResourceFactoryImpl());
    
  Resource resource = 
    resourceSet.createResource(URI.createFileURI(DATA_FOLDER + "EMFDOMTreeNode.xml"));
  resource.getContents().add(documentRoot);
  resource.save(System.out, null);

EMF supports converting an XML resource directly to a DOM document (and can record a map between the two representations). The following could be used to display the result of such a synthesized DOM:

  Document document = ((XMLResource)resource).save(null, null, null);
  TransformerFactory transformerFactory = TransformerFactory.newInstance();
  Transformer transformer = transformerFactory.newTransformer();
  transformer.transform(new DOMSource(document), new StreamResult(System.out));
  System.out.println();

Loading and Traversing EMF AnyType

Loading an instance with EMF requires exactly the same setup as for saving.

  ResourceSet resourceSet = new ResourceSetImpl();
  final ExtendedMetaData extendedMetaData = 
    new BasicExtendedMetaData(resourceSet.getPackageRegistry());
  resourceSet.getLoadOptions().put
    (XMLResource.OPTION_EXTENDED_META_DATA, extendedMetaData);
  resourceSet.getResourceFactoryRegistry().getExtensionToFactoryMap().put
    (Resource.Factory.Registry.DEFAULT_EXTENSION, 
     new GenericXMLResourceFactoryImpl());

Given such a resource set, we can simply demand load the resource, fetch the document root, and extract the root element, which we'll assume to be an instance of AnyType.

  Resource resource = 
    resourceSet.getResource(URI.createFileURI(DATA_FOLDER + "EMFDOMTreeNode.xml"), true);
  EObject documentRoot = (EObject)resource.getContents().get(0);
  AnyType rootTreeNode = (AnyType)documentRoot.eContents().get(0);
  

To access the label feature during the instance traversal, we'll cache the label. Since we used the same extended meta data to demand feature, we can now use the same API to look up the demanded features we've created:

  final EStructuralFeature labelAttribute = extendedMetaData.demandFeature(null, "label", false);
  

As with DOM, a recursive method can be used to traverse the structure of the document. Starting with the root node, we traverse each node, printing the namespace, name, and label attributes, and visiting each type of child node.

  new Object()
  {
   public void traverse(String indent, AnyType anyType)
   {
    System.out.println
      (indent + "{" + extendedMetaData.getNamespace(anyType.eContainmentFeature()) + "}" + 
                   extendedMetaData.getName(anyType.eContainmentFeature()));
    System.out.println(indent + " label=" + anyType.eGet(labelAttribute));
    FeatureMap featureMap = anyType.getMixed();
    for (int i = 0, size = featureMap.size(); i < size; ++i)
    {
     EStructuralFeature feature = featureMap.getEStructuralFeature(i);
     if (FeatureMapUtil.isText(feature))
     {
      System.out.println(indent + " '" + featureMap.getValue(i).toString().replaceAll("n", "n") + "'");
     }
     else if (FeatureMapUtil.isComment(feature))
     {
      System.out.println(indent + " <!--" + featureMap.getValue(i) + "-->");
     }
     else if (FeatureMapUtil.isCDATA(feature))
     {
      System.out.println(indent + " <![CDATA[" + featureMap.getValue(i) + "]]>");
     }
     else if (feature instanceof EReference)
     {
      traverse(indent + " ", (AnyType)featureMap.getValue(i));
     }
    }
   }
  }.traverse("", rootTreeNode);

Raising the level of abstraction

Reflecting back on the both approaches, it's quite clear that they are roughly equivalent, and neither approach is very satisfying. Moreover, the Ecore meta data seems to provide little additional value to justify the complexity it introduces.

Going back to the original problem of XML binding, a basic problem is that the XML infoset is a very complex low-level abstraction and that complexity is reflected in any fully general data model representation of it. That's why reinventing DOM will not lead to a great new leap forward.

To address this fundamental problem, XML Schema helps bring order to the unneeded generality of the XML infoset by specifying a grammatical type structure to impose on the infoset. In a quest to exploit this order, the problem of XML binding to Java has focused primarily on mapping XML Schema to Java as a way to produce a cleaner, more type-specific, API. Let's look into this more closely by considering possible schemas for our example. The following trivial schema is definitely a schema for our example, but an effectively useless one:

  <xsd:schema xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
     targetNamespace="http://www.eclipse.org/emf/example/dom/Tree">
   <xsd:element name="rootNode" type="xsd:anyType"/>
  </xsd:schema>

This schema is interesting to consider, because it is in effect precisely the schema that's being used in the preceding DOM example. The anyType is defined in XML Schema's schema for schema as the following complex type:

 <xsd:complexType name="anyType" mixed="true">
  <xsd:sequence>
   <xsd:any minOccurs="0" maxOccurs="unbounded" processContents="lax"/>
  </xsd:sequence>
  <xsd:anyAttribute processContents="lax"/>
 </xsd:complexType>

I.e., it defines a type for elements that have mixed content, any elements, and any attributes. The Ecore mapping for XML Schema's anyType is as follows, with each access method there for the obvious reason.

  public interface AnyType extends EObject
  {
   FeatureMap getMixed();
   FeatureMap getAny();
   FeatureMap getAnyAttribute();
  }

Again, it's a fully general representation of mixed text, any element features, and any attribute features. It says nothing about what specific things to expect in the content and it gives us no information with which to provide a simpler API.

The following schema is a more interesting and restrictive version that describes the specific things we really do expect to see in our example.

  <?xml version="1.0" encoding="UTF-8"?>
  <xsd:schema xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
     targetNamespace="http://www.eclipse.org/emf/example/dom/Tree">
   <xsd:complexType mixed="true" name="TreeNode">
    <xsd:sequence>
     <xsd:element form="qualified" maxOccurs="unbounded" 
        name="childNode" type="tree:TreeNode"/>
    </xsd:sequence>
    <xsd:attribute name="label" type="xsd:ID"/>
   </xsd:complexType>
   <xsd:element name="rootNode" type="tree:TreeNode"/>
  </xsd:schema>

It defines a TreeNode complex type with an element for "childNode" that is recursively of type TreeNode, and an attribute for "label" of type ID, so the value of this must be unique in the document. It also defines a "rootNode" element of TreeNode type.

The whole point of a schema is to describe the form of data to be expected and it is this information that facilitates the preparation of more specific data structures and APIs to carry and access the data. We will see how EMF takes advantage of this meta data to improve on the DOM-level complexity of the AnyType API.

In order to help highlight some of the interesting high-level capabilities of EMF, we will augment this schema to add EMF-specific annotations and an additional attribute to TreeNode. Specifically, we add an "ecore:name" annotation, to specify the name of the corresponding feature, an attribute called "references", whose value is list of anyURI, and an "ecore:reference" annotation for that attribute, to specify that these URIs represent references to other TreeNodes. The idea of the URI-based reference is for it to appear just as does an "href" in HTML, i.e., of the form <location-of-document>#<xml-id-in-document>. Here's the slightly more interesting schema we will use for the example.

  <?xml version="1.0" encoding="UTF-8"?>
  <xsd:schema xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore"
     xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
     targetNamespace="http://www.eclipse.org/emf/example/dom/Tree">
   <xsd:complexType mixed="true" name="TreeNode">
    <xsd:sequence>
     <xsd:element ecore:name="childNodes" form="qualified" maxOccurs="unbounded"
        name="childNode" type="tree:TreeNode"/>
    </xsd:sequence>
    <xsd:attribute name="label" type="xsd:ID"/>
    <xsd:attribute ecore:reference="tree:TreeNode" name="references">
     <xsd:simpleType>
      <xsd:list itemType="xsd:anyURI"/>
     </xsd:simpleType>
    </xsd:attribute>
   </xsd:complexType>
   <xsd:element name="rootNode" type="tree:TreeNode"/>
  </xsd:schema>

An XML Schema is just an instance of a model and EMF provides the XSD model to represent such instances. Creating an instance XSD is accomplished much like how we created the example AnyType instance, only with a specific API for XML Schema instances. e.g.,

  XSDSchema xsdSchema = XSDFactory.eINSTANCE.createXSDSchema();
  xsdSchema.setTargetNamespace(NAMESPACE_URI);

EMF exploits the information in XML Schemas by providing an Ecore builder that converts XSD instances to Ecore instances. To do that, we set up exactly the same environment as before, and create an instance of the XSDEcoreBuilder. We pass it an absolute URI representing the location of the sample schema.. It creates a corresponding Ecore instance given our XSD instance. In particular, each namespace in the schema results in a corresponding EPackage, which acts as a container for all the types defined in that namespace. There's only one namespace in our example, so we extract that one corresponding package.

  XSDEcoreBuilder xsdEcoreBuilder = 
  new XSDEcoreBuilder(extendedMetaData);
  URI schemaLocationURI = 
    URI.createFileURI(new File(MODEL_FOLDER + 
 "DOMEMFTreeNode.xsd").getAbsolutePath());
  xsdEcoreBuilder.generate(schemaLocationURI);
  EPackage ePackage = extendedMetaData.getPackage(NAMESPACE_URI);
  

The following traverses the Ecore structure produced for our sample schema:

  System.out.println("package " + ePackage.getNsURI());
  for (Iterator i = 
  ePackage.getEClassifiers().iterator(); i.hasNext(); )
  {
   //
   EClassifier eClassifier = (EClassifier)i.next();
   if (eClassifier instanceof EClass)
   {
    //
    EClass eClass = (EClass)eClassifier;
    System.out.println(" class " + eClass.getName());
    for (Iterator j = 
 eClass.getEStructuralFeatures().iterator(); j.hasNext(); )
    {
     //
     EStructuralFeature eStructuralFeature = (EStructuralFeature)j.next();
     if (eStructuralFeature instanceof EReference)
     {
      EReference eReference = (EReference)eStructuralFeature;
      System.out.println
       ("  reference " + eReference.getName() + ":" 
    + eReference.getEReferenceType().getName());
     }
     else
     {
      EAttribute eAttribute = (EAttribute)eStructuralFeature;
      System.out.println
       ("  attribute " + eAttribute.getName() + ":" 
    + eAttribute.getEAttributeType().getName());
     }
    }
   }
   else
   {
    EDataType eDataType = (EDataType)eClassifier;
    System.out.println(" data type " + eDataType.getName());
   }
  }
  

It produces the following output:

  package http://www.eclipse.org/emf/example/dom/Tree
   class DocumentRoot
    attribute mixed:EFeatureMapEntry
    reference xMLNSPrefixMap:EStringToStringMapEntry
    reference xSISchemaLocation:EStringToStringMapEntry
    reference rootNode:TreeNode
   class TreeNode
    attribute mixed:EFeatureMapEntry
    reference childNodes:TreeNode
    attribute label:ID
    reference references:TreeNode

Using this Ecore model, we load the example instance as follows:

  Resource resource = 
    resourceSet.getResource
 (URI.createFileURI(DATA_FOLDER + 
 "EMFDOMTreeNode.xml"), true);
  

From that resource, we can get the document root instance and from that we can get the root object. This time we won't assume the root is an instance of AnyType, which in fact it won't be.

  EObject documentRoot = 
  (EObject)resource.getContents().get(0);
  final EObject rootTreeNode = 
  (EObject)documentRoot.eContents().get(0);
  

Instead we'll assume the root object and all the intermediate objects are of some special type representing a TreeNode. We can access that type directly from the instance, or we can look it up by name.

  if (rootTreeNode.eClass() != 
  extendedMetaData.getType(NAMESPACE_URI, "TreeNode"))
  {
   throw new Exception("Bad meta data");
  }
  

Since we know what kind of data to expect for a TreeNode, we know to cache the three features needed to access the instance data during a traversal. Recall that we introduced a references attribute of type list of anyURI so that each tree node can not only contain other tree nodes but can also reference other tree nodes.

  final EStructuralFeature mixedFeature = 
    extendedMetaData.getMixedFeature(rootTreeNode.eClass());
  final EStructuralFeature labelAttribute = 
    extendedMetaData.getAttribute(rootTreeNode.eClass(), null, "label");
  final EStructuralFeature referencesAttribute = 
    extendedMetaData.getAttribute
 (rootTreeNode.eClass(), null, "references");

Since the mixed feature provides access to the "childNode" feature in addition to the mixed content, we won't need to use it directly in this example.

We can visit the data much as before, but without assuming they are AnyType instances. I.e., we use purely reflection on the EMF's EObject API. Notice how the traversal uses the references feature to fetch the list of references for this tree node and to add the root tree node as one of the nodes being referenced.

  new Object()
  {
   public void traverse(String indent, EObject eObject)
   {
    System.out.println
      (indent + "{" + 
   extendedMetaData.getNamespace(eObject.eContainmentFeature()) 
   + "}" + 
                   extendedMetaData.getName(eObject.eContainmentFeature()));
    System.out.println(indent + " label=" 
 + eObject.eGet(labelAttribute));
    
    ((List)eObject.eGet(referencesAttribute)).add(rootTreeNode);

    // Access the feature map reflectively.
    //
    FeatureMap featureMap = (FeatureMap)eObject.eGet(mixedFeature);
    for (int i = 0, size = featureMap.size(); i < size; ++i)
    {
     EStructuralFeature feature = featureMap.getEStructuralFeature(i);
     if (FeatureMapUtil.isText(feature))
     {
      System.out.println(indent + " '" + 
   featureMap.getValue(i).toString().replaceAll("n", "n") + "'");
     }
     else if (FeatureMapUtil.isComment(feature))
     {
      System.out.println(indent + " <!--" + featureMap.getValue(i) + "-->");
     }
     else if (FeatureMapUtil.isCDATA(feature))
     {
      System.out.println(indent + 
   " <![CDATA[" + featureMap.getValue(i) + "]]>");
     }
     else if (feature instanceof EReference)
     {
      traverse(indent + " ", (EObject)featureMap.getValue(i));
     }
    }
   }
  }.traverse("", rootTreeNode);

While the traversal prints the same trace as before, the instance has modified references, so serializing it will shows the references we've set:

  <?xml version="1.0" encoding="UTF-8"?>
  <tree:rootNode xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree" label="root"
    references="#root">
   <tree:childNode label="text" references="#root">text</tree:childNode>
   <tree:childNode label="comment" references="#root"><!--comment--></tree:childNode>
   <tree:childNode label="cdata" references="#root"><![CDATA[<cdata>]]></tree:childNode>
  </tree:rootNode>

When serializing each referenced object, EMF determines the resource containing the object, EObject.eResource(), and asks that resource how best to reference the object, Resource.getURIFragmentReference(EObject). Because the label feature of the tree node is an ID, that value can be used to uniquely locate the instance in the document, and so that's what's returned. The reference is formed by appending the fragment to the resource's URI, resource.getURI(). Before using this URI, EMF determines if the URI can be made relative to the URI of resource containing the reference. In this example, that means all the same document references to the root object result simply in "#root."

During load, this process is reversed. I.e., relative URIs are resolved to make them absolute, the portion of the URI without the fragment is used to demand load the resource, and the fragment portion is used to locate the object in the resource, Resource.getEObject. This URI behavior should come as no surprise, since that's how "href"s work in HMTL documents.

Generated APIs

To complete EMF's XML binding picture, we take advantage of EMF's Java code generation capabilities. We've already seen that given the schema for this example, we can use EMF's tools to import it as Ecore. From that Ecore model, a Java API and implementation can be generated, in fact a complete application including an editor. An online tutorial to show how and Ant tasks to automate it are available. For this sample, EMF generates a TreeNode interface with the expected methods:

  public interface TreeNode extends EObject
  {
   FeatureMap getMixed();
   EList getChildNodes();
   String getLabel();
   void setLabel(String value);
   EList getReferences();
  }

and a DocumentRoot interface with some general purpose methods, as well as methods for accessing the global elements, i.e., just the root node in this example.

   public interface DocumentRoot extends EObject
  {
   FeatureMap getMixed();
   EMap getXMLNSPrefixMap();
   EMap getXSISchemaLocation();
   TreeNode getRootNode();
   void setRootNode(TreeNode value);
  }

To use this generated model, we set up the environment as before, but register the generated resource and the generated package.

  resourceSet.getResourceFactoryRegistry()
  .getExtensionToFactoryMap().put
   (Resource.Factory.Registry.DEFAULT_EXTENSION, 
    new TreeResourceFactoryImpl());
  resourceSet.getPackageRegistry().put
  (TreePackage.eNS_URI, TreePackage.eINSTANCE);
  

We can use this environment to load the resource for the earlier example instance with the interesting root references. Note how an instance saved using a dynamic Ecore model is now loaded using a statically generated model.

  Resource resource = 
    resourceSet.getResource
      (URI.createFileURI(DATA_FOLDER + 
   "EMFDOMTreeNodeWithReferences.xml"), true);
  

This time when we fetch the document root and the root instance we can assume they are statically typed as expected according to the generated API.

  DocumentRoot documentRoot = (DocumentRoot)resource.getContents().get(0);
  TreeNode rootTreeNode = documentRoot.getRootNode();
  

EMF's reflective data API supports a bountiful collection of useful services, e.g., a copier, an equality helper, various cross referencers, and so on. We'll demonstrate that here where we will use a copier instance directly, so that we can take advantage of the map it computes from originals to copies. We create a copier, copy the document root, and then ensure that all the cross references are initialized based on the available copies.

  final EcoreUtil.Copier copier = 
  new EcoreUtil.Copier();
  final DocumentRoot documentRootCopy = 
    (DocumentRoot)copier.copy(documentRoot);
  copier.copyReferences();
  

We will traverse the tree as so many times before, but now we can assume they are tree node instances. During this traversal, we will make each copied node refer back to the original node from which it was copied.

  //
  new Object()
  {
   public void traverse(String indent, TreeNode treeNode)
   {
    System.out.println
      (indent + "{" + 
   extendedMetaData.getNamespace(treeNode.eContainmentFeature()) + 
   "}" + 
                   extendedMetaData.getName(treeNode.eContainmentFeature()));
    System.out.println(indent + " label=" + 
 treeNode.getLabel());
    for (Iterator i = treeNode.getReferences().iterator(); i.hasNext(); )
    {
     TreeNode referencedTreeNode = (TreeNode)i.next();
     System.out.println(indent + " reference=#" 
  + referencedTreeNode.getLabel());
    }
    
    TreeNode treeNodeCopy = (TreeNode)copier.get(treeNode);
    treeNodeCopy.getReferences().add(treeNode);
    
    FeatureMap featureMap = treeNode.getMixed();
    for (int i = 0, size = featureMap.size(); i < size; ++i)
    {
     EStructuralFeature feature = featureMap.getEStructuralFeature(i);
     if (FeatureMapUtil.isText(feature))
     {
      System.out.println(indent + " '" + 
   featureMap.getValue(i).toString().replaceAll
   ("n", "n") + "'");
     }
     else if (FeatureMapUtil.isComment(feature))
     {
      System.out.println(indent + 
   " <!--" + featureMap.getValue(i) + "-->");
     }
     else if (FeatureMapUtil.isCDATA(feature))
     {
      System.out.println(indent + " <![CDATA[" + 
   featureMap.getValue(i) + "]]>");
     }
     else if (feature instanceof EReference)
     {
      traverse(indent + " ", (TreeNode)featureMap.getValue(i));
     }
    }
   }
  }.traverse("", rootTreeNode);

Saving the copy produces the following:

  <?xml version="1.0" encoding="UTF-8"?>
  <tree:rootNode xmlns:tree="http://www.eclipse.org/emf/example/dom/Tree" 
  label="root"
    references="#root TreeNodeWithReferences.xml#root">
   <tree:childNode label="text" 
     references="#root TreeNodeWithReferences.xml#text">text</tree:childNode>
   <tree:childNode label="comment" 
     references="#root TreeNodeWithReferences.xml#comment">
  <!--comment--></tree:childNode>
   <tree:childNode label="cdata" 
     references="#root TreeNodeWithReferences.xml#cdata">
  <![CDATA[<cdata>]]></tree:childNode>
  </tree:rootNode>

To support this multi resource model, when parsing a resource, EMF creates a proxy placeholder for each cross document URI reference. That placeholder is resolved on demand as it is fetched later during data traversal, much as an HMTL link is resolved to load a new document as you click on it.

Reflecting on the landscape

We've really only touched the tip of the iceberg in terms of all the capabilities of EMF.

The primary goal of EMF's design has been to provide a powerful and flexible, yet simple and efficient, high-level, abstract data model. EMF's application to XML binding has arisen as a natural evolutionary consequence of a powerful representation's ability to assimilate other data.

Ecore is simpler than XML Schema and yet more powerful in important ways, e.g., typed references and multiple inheritance support. EObject too is simpler than DOM and yet more powerful in very important ways, e.g., type-safe efficient reflective access, so it should come as no surprise that the XML binding problem can be subsumed.

Reflecting back on the general nature of the problem, it is clear that although there may be many ways in which XML can be bound to Java, any binding for the full XML infoset will look much like DOM and that any binding for XML Schema will look much like the generated bean-like API pattern we've outlined.

There is in fact little significant difference between the various XML binding solutions. Most of the differences between them arise because of the fact that XML Schema brings many complex problems into the picture, necessitating the introduction of undesirable complexities, like FeatureMap. Each complete binding solution surfaces the infoset's complexity in a different way. For example, our sample schema (modified to replace anyURI with IDREF, since JAXB doesn't support cross document references) produces this API in JAXB 2.0:

 public class TreeNode
 {
   public List<Serializable> getContent();
   public String getLabel();
   public void setLabel(String value);
   public List<Object> getReferences();
 }

With JAXB, the need to support mixed content results in a loss of the access method for child nodes. Instead there is a single general content access method. This content list can contain strings, representing the mixed content, and JAXBElement's representing value wrappers. It's easy to draw a parallel between JAXBElement and EMF's FeatureMap.Entry, but if we removed mixed content from the picture, the generated APIs for EMF and JAXB are effectively identical.

The problem of binding XML to Java remains a problem because XML Schema's complexities inevitably yield suboptimal bindings in all general binding solutions. Only a shift in focus away from concrete syntax and toward abstract syntax will yield optimal abstract bindings. APIs should focus on high-level abstractions, while the details of various possible concrete serializations should remain subordinate. Models are the fundamental underlying unity that brings order to this picture; everything can be modeled and is a model.

We've explored how XML Schema and Ecore parallel each other, so the fundamental distinction between abstract syntax and concrete syntax is something that now becomes clear if you consider that although XML Schema's schema for schemas describes XML Schema's concrete syntax, just as Ecore's Ecore model describes Ecore's abstract syntax, it takes Ecore's XSD model to describe XML Schema's abstract syntax. In other words, while XML Schema helps describe what is valid, Ecore helps describe what it means.

In the universe of all models, the simplest self describing model plays a singularly unique role as the one model that binds all other models.

About the Authors

Ed Merks is a co-lead of the top-level Eclipse Modeling project as well as the lead of the Eclipse Modeling Framework project. He has many years of in-depth experience in the design and implementation of languages, frameworks and application development environments. He holds a Ph.D. in computing science and is a co-author of the authoritative "Eclipse Modeling Framework, A Developer's Guide" (Addison-Wesley 2003). He works for IBM Rational at the Toronto Lab.

Elena Litani is a software developer working for IBM. She is one of the main contributors to the Eclipse Modeling Framework (EMF) project at Eclipse.org working on implementation of EMF and Service Data Objects (SDO). Previously, Elena was one of the main contributors to the Apache Xerces2 project, working on Xerces2 XML Schema and DOM Level 3 implementations, as well as analyzing and improving performance of the parser.