The Java and C versions 1.0 of VTD-XML -- an open-source, high-performance and non-extractive XML processing API -- are freely available on sourceforge.net, with source code, documentation, detailed description of API and code examples. VTD-XML is geared for very fast examination of XML in an in-memory buffer.
New in VTD-XML 1.0 is the integrated support of XPath that also features a easy-to-use interface that further enhances VTD-XML's inherent benefits, such as CPU/memory efficiency, random access, and incremental update. Demos are available at http://www.ximpleware.com/demo.html.
For further reading, please refer to the following articles about VTD-XML:
-
VTD-XML 1.0, another XML API, released under GPL (20 messages)
- Posted by: Jimmy Zhang
- Posted on: November 29 2005 14:12 EST
Threaded Messages (20)
- Limits by Kit Davies on November 30 2005 10:33 EST
- Limits by Jimmy Zhang on November 30 2005 13:17 EST
- Interesting idea by peter lin on November 30 2005 13:14 EST
- Interesting idea by Brian Miller on December 01 2005 02:00 EST
-
Interesting idea by Kit Davies on December 01 2005 04:48 EST
-
Interesting idea by Jimmy Zhang on December 01 2005 01:20 EST
- Any comparison to stream aka pull parsers? by peter lin on December 01 2005 01:25 EST
-
Interesting idea by Brian Miller on December 02 2005 12:06 EST
- Interesting idea by Jimmy Zhang on December 02 2005 01:28 EST
-
Interesting idea by Jimmy Zhang on December 01 2005 01:20 EST
- Interesting idea by Jimmy Zhang on December 01 2005 01:24 EST
-
Interesting idea by Kit Davies on December 01 2005 04:48 EST
- Interesting idea by Brian Miller on December 01 2005 02:00 EST
- VTD-XML 1.0, another XML API, released under GPL by Anick Thistle on November 30 2005 15:54 EST
- flat DOM by Jimmy Zhang on November 30 2005 16:05 EST
-
flat DOM by Anick Thistle on November 30 2005 04:09 EST
-
flat DOM by Jimmy Zhang on November 30 2005 06:38 EST
-
flat DOM by Anick Thistle on November 30 2005 06:50 EST
- flat DOM by Jimmy Zhang on November 30 2005 06:59 EST
-
flat DOM by Anick Thistle on November 30 2005 06:50 EST
-
flat DOM by Jimmy Zhang on November 30 2005 06:38 EST
-
flat DOM by Anick Thistle on November 30 2005 04:09 EST
- flat DOM by Jimmy Zhang on November 30 2005 16:05 EST
- Extraction still required? by Andy Grove on November 30 2005 17:19 EST
- Extraction still required? by Jimmy Zhang on November 30 2005 18:35 EST
-
Extraction still required? by Andy Grove on November 30 2005 07:10 EST
- Extraction still required? by Jimmy Zhang on November 30 2005 07:30 EST
-
Extraction still required? by Andy Grove on November 30 2005 07:10 EST
- Extraction still required? by Jimmy Zhang on November 30 2005 18:35 EST
-
Limits[ Go to top ]
- Posted by: Kit Davies
- Posted on: November 30 2005 10:33 EST
- in response to Jimmy Zhang
Though for 95% of applications I'm sure this would be extremely useful, I could see that some may have problems with the limits for starting offset and depth. Maybe dock some bits from QName length (511 + 1023 is a lot!) and add to others?
But I certainly don't want to detract too much from a very interesting project. Do I gather it is being used in XML hardware?
Kit -
Limits[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: November 30 2005 13:17 EST
- in response to Kit Davies
Yes, moving bits around is certainly going to improve
things. The goal of vtd-XML, is to first make things useful, then improve upon that.
jz -
Interesting idea[ Go to top ]
- Posted by: peter lin
- Posted on: November 30 2005 13:14 EST
- in response to Jimmy Zhang
I took a quick glance at the documentation and it looks interesting. I had similar ideas in the past, but was way too lazy to actually write something. Anyone with real world experience with it?
peter -
Interesting idea[ Go to top ]
- Posted by: Brian Miller
- Posted on: December 01 2005 02:00 EST
- in response to peter lin
I had similar ideas in the past, but was way too lazy to actually write something.
I assume you're referring to the "non-extractive tokenization approach that maintains the source document intact in memory ... a cursor-based API that retains most of DOM's random-access capabilities at a fraction of its memory usage". It's the phrase in bold that I don't like. What DOM features are lost? Why ignore that Moore's Law increases available memory and also speeds heap processing? Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it? -
Interesting idea[ Go to top ]
- Posted by: Kit Davies
- Posted on: December 01 2005 04:48 EST
- in response to Brian Miller
Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it?
If you go to the use cases for Binary XML, there are several there that cover random access into very large documents. I could see VTD-XML as being useful in these sorts of cases.
AIUI, VTD-XML is more of a document indexer than parser. -
Interesting idea[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: December 01 2005 13:20 EST
- in response to Kit Davies
Kit, it depends on what the definition of "parser" is.
In my view it is a parser first, indexer second, by parser
my definition is to prepare the document into a form that
applications can consume (and do whatever it wants to do)
Like I point out in teh "xml on a chip" article,
XML's performance issue is really a problem of XML's processing models, which has a little to do with XML,
so replacing XML with binary version is probably not
solving the right problem...Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it?
If you go to the use cases for Binary XML, there are several there that cover random access into very large documents. I could see VTD-XML as being useful in these sorts of cases. AIUI, VTD-XML is more of a document indexer than parser. -
Any comparison to stream aka pull parsers?[ Go to top ]
- Posted by: peter lin
- Posted on: December 01 2005 13:25 EST
- in response to Jimmy Zhang
Has anyone done a comparison to xml stream parsers or XPP3? Just curioius.
peter -
Interesting idea[ Go to top ]
- Posted by: Brian Miller
- Posted on: December 02 2005 12:06 EST
- in response to Jimmy Zhang
Kit, it depends on what the definition of "parser" is. In my view it is a parser first, indexer second, by parser my definition is to prepare the document into a form that applications can consume...
Isn't VTD-XML less like DOM and more like SAX with nonlinearity (the ability to jump around and tokenize unvisited document fragments in any order, not only document order)? What additional services does VTD-XML provide that SAX doesn't? What can DOM do that VTD-XML doesn't? -
Interesting idea[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: December 02 2005 13:28 EST
- in response to Brian Miller
Try the demo http://www.ximpleware.com/demo.html
and a lot of your questions can be answered...Kit, it depends on what the definition of "parser" is. In my view it is a parser first, indexer second, by parser my definition is to prepare the document into a form that applications can consume...
Isn't VTD-XML less like DOM and more like SAX with nonlinearity (the ability to jump around and tokenize unvisited document fragments in any order, not only document order)? What additional services does VTD-XML provide that SAX doesn't? What can DOM do that VTD-XML doesn't? -
Interesting idea[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: December 01 2005 13:24 EST
- in response to Brian Miller
VTD-XML is simpler to use than DOM: in DOM you have
to do a lot of node casting, in VTD-XML you don't, but
still VTD-XmL is not DOM...
Also the write feature of VTD-XML is different:
DOM modifies teh data structure, VTD-XML modifies
XML directly.I had similar ideas in the past, but was way too lazy to actually write something.
I assume you're referring to the "non-extractive tokenization approach that maintains the source document intact in memory ... a cursor-based API that retains most of DOM's random-access capabilities at a fraction of its memory usage". It's the phrase in bold that I don't like. What DOM features are lost? Why ignore that Moore's Law increases available memory and also speeds heap processing? Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it? -
VTD-XML 1.0, another XML API, released under GPL[ Go to top ]
- Posted by: Anick Thistle
- Posted on: November 30 2005 15:54 EST
- in response to Jimmy Zhang
Looks okay. Still not as simple (and probably not as fast) as my personal favorite -- http://simple-software.ca/flat_dom.jsp -
flat DOM[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: November 30 2005 16:05 EST
- in response to Anick Thistle
looks interesting ... unfortunately, once you go down the path
of SAX, object allocation cost becomes inevitable, and perforamnce is going to suffer...Looks okay. Still not as simple (and probably not as fast) as my personal favorite -- http://simple-software.ca/flat_dom.jsp
-
flat DOM[ Go to top ]
- Posted by: Anick Thistle
- Posted on: November 30 2005 16:09 EST
- in response to Jimmy Zhang
Really? I thought SAX was about as fast as you can get? But I havn't benchmarked it or anything -- just my impression versus DOM. I wrote my own 'dumb' parser once which was a dead stupid version and I was blown away that SAX was faster (twice as fast). I was muttering to myself "how could that be .." ;-)
Cheers. -
flat DOM[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: November 30 2005 18:38 EST
- in response to Anick Thistle
VTD-XML is up to twice as fast as SAX with NULL content handler, what this means is that just do a dry run with SAX parser, without any custom logic or program code,
VTD-XML is still much faster... Its C version significantly beats Expat's C version almost everytime ...Really? I thought SAX was about as fast as you can get? But I havn't benchmarked it or anything -- just my impression versus DOM. I wrote my own 'dumb' parser once which was a dead stupid version and I was blown away that SAX was faster (twice as fast). I was muttering to myself "how could that be .." ;-)Cheers.
-
flat DOM[ Go to top ]
- Posted by: Anick Thistle
- Posted on: November 30 2005 18:50 EST
- in response to Jimmy Zhang
VTD-XML is up to twice as fast as SAX with NULL content handler, what this means is that just do a dry run with SAX parser, without any custom logic or program code,VTD-XML is still much faster... Its C version significantly beats Expat's C version almost everytime ...
Excellent! Congratulations, as that seems like quite a feat to me, however I wonder if you would come out ahead much if you were extracting most of the data anyways?? I looked at your benchmarks and you were faster by 20 to 30%, but I didn't study the examples closely i.e. how much data is being extracted etc. Your example of being twice as fast with a NULL handler would seem to be the ideal scenario for VTD-XML (assuming it also does not extract any data also).
Anyways it sounds like a good idea with potential. Good luck. -
flat DOM[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: November 30 2005 18:59 EST
- in response to Anick Thistle
The benchmark is very old, the perforamnce of VTD-XML has since improved quite a bit, 500mb/sec is our estimate ona 3 GHz Pentium processor... the entire idea of VTD-XML is that,
extracting data, in most cases, doesn't acomplish much, in other words, you simply don't have to... offset and length is all you need...VTD-XML is up to twice as fast as SAX with NULL content handler, what this means is that just do a dry run with SAX parser, without any custom logic or program code,VTD-XML is still much faster... Its C version significantly beats Expat's C version almost everytime ...
Excellent! Congratulations, as that seems like quite a feat to me, however I wonder if you would come out ahead much if you were extracting most of the data anyways?? I looked at your benchmarks and you were faster by 20 to 30%, but I didn't study the examples closely i.e. how much data is being extracted etc. Your example of being twice as fast with a NULL handler would seem to be the ideal scenario for VTD-XML (assuming it also does not extract any data also). Anyways it sounds like a good idea with potential. Good luck. -
Extraction still required?[ Go to top ]
- Posted by: Andy Grove
- Posted on: November 30 2005 17:19 EST
- in response to Jimmy Zhang
Hi Jimmy,
I've read the documentation (briefly) and this looks like an interesting idea but I have a question. I understand that the parser does not extract the XML data into a separate DOM structure but essentially uses indexes into the original document to reduce memory overhead. But surely the whole point of parsing the XML document in most cases is to enable the data in that document to be extracted into Java variables so that they can be processed in some way. Even with this fast approach to parsing there will still be the same overhead of creating the Java objects and populating them after the document has been parsed? I can see that there is a performance benefit of this approach if only a subset of the document is to be extracted. Have I understood this correctly?
Thanks,
Andy. -
Extraction still required?[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: November 30 2005 18:35 EST
- in response to Andy Grove
Andy, there are two ways to work XML, roughly speaking,
one is "data centric" which is to convert XML into java objects, some call it data binding.
The other is "document centric" which basically treats XML
as a message.
Web Services/ SOA community has generally considered document centric view of XML as the right way as it takes full advantages of loose coupling aspect of XML. In this view, XML documents are *not* restricted by schema, so that the applications are less likely to break with the evolution of data format.
Data centric view of XML, on the other hand, is not designed to take those advantages. At the same time, data centric view of XML, which requires a large number of object creation, is also very slow.
with VTD-XML, message-based document centric processing, not only takes full advantages of XML's loose-coupled=ness, but also is much faster than data centric XML data binding.
Check out the following article on this topic
http://www.fawcette.com/xmlmag/2002_04/magazine/departments/endtag/Hi Jimmy,I've read the documentation (briefly) and this looks like an interesting idea but I have a question. I understand that the parser does not extract the XML data into a separate DOM structure but essentially uses indexes into the original document to reduce memory overhead. But surely the whole point of parsing the XML document in most cases is to enable the data in that document to be extracted into Java variables so that they can be processed in some way. Even with this fast approach to parsing there will still be the same overhead of creating the Java objects and populating them after the document has been parsed? I can see that there is a performance benefit of this approach if only a subset of the document is to be extracted. Have I understood this correctly?Thanks,Andy.
-
Extraction still required?[ Go to top ]
- Posted by: Andy Grove
- Posted on: November 30 2005 19:10 EST
- in response to Jimmy Zhang
I understand the different models and I probably didn't state my original question in enough detail.
If I need to pull the data out of the XML document (which presumably is the point of parsing it in the first place) then the API will still need to extract this data and place it into newly allocated Java variables (Strings or primitives). I wasn't referring to binding to a Java object model ala Castor or XMLBeans.
My point is that post-parsing I will still need to extract the data from the XML document in order to do something with it and that is the point where the usual overhead of memory allocation and data copying will occur. I can see that even with this overhead there are some potential advantages over the traditional DOM model in certain situations (with the VTD approach it is more likely that transient objects will be created that can be garbage collected after they have been processed and if only a subset of the document needs processing then the memory overhead will be lower than the classic DOM approach).
I'm just concerned that the VTD benchmarks are not reflective of real world use cases where data is actually extracted from the XML document for processing.
Cheers,
Andy. -
Extraction still required?[ Go to top ]
- Posted by: Jimmy Zhang
- Posted on: November 30 2005 19:30 EST
- in response to Andy Grove
Andy, at that point the performance impact is also very small, for several reasons
1. because you already knew the offset value and length, you can allocate a string and fill in the char fairly easily
2. you navigate the doc according to element/attr names, then you extract value/text node, so a lot of unncessarily extractions (of element/attr name) have been avoided
3. If you want to extract the int/float value of a field, you can perform the conversion from VTD record to int/float directly, which bypass the string creationI understand the different models and I probably didn't state my original question in enough detail.If I need to pull the data out of the XML document (which presumably is the point of parsing it in the first place) then the API will still need to extract this data and place it into newly allocated Java variables (Strings or primitives). I wasn't referring to binding to a Java object model ala Castor or XMLBeans.My point is that post-parsing I will still need to extract the data from the XML document in order to do something with it and that is the point where the usual overhead of memory allocation and data copying will occur. I can see that even with this overhead there are some potential advantages over the traditional DOM model in certain situations (with the VTD approach it is more likely that transient objects will be created that can be garbage collected after they have been processed and if only a subset of the document needs processing then the memory overhead will be lower than the classic DOM approach).I'm just concerned that the VTD benchmarks are not reflective of real world use cases where data is actually extracted from the XML document for processing.Cheers,Andy.