667514 members! Sign up to stay informed.

Sponsored Links


Resources

Enterprise Java
Research Library

Get Java white papers, product information, case studies and webcasts

XML & Web services XML & Web services XML & Web services Messages: 2 Messages: 2 Messages: 2 Printer friendly Printer friendly Printer friendly Post reply Post reply Post reply XML XML XML

Dynamic translation of XML into CSV using XSD information

Posted by: Johnny Bravo on January 11, 2007 DIGG
Hi,

I have a web application that needs to accept an XSD and a corresponding XML as input via form parameters.

I need to parse the XSD, identify the simple and complex types along with other necessary information, and represent this in some sort of a structural format (preferably a tree hierarchy)
This structural information will then be used to help parse the associated XML file, record by record into a CSV file.
The XSD will be using only a limited set of XSD constructs (namespaces, imports can be ignored for the time being)

What is the best way that I can go forward with this?

I have considered these options.
XSD-Java binding tools (XMLBeans, JAXB).
I can get a type hierarchy using these tools.
countries
.country
..id
..name
..states
...state
....id
....name

This would be the ideal scenario as I can create a new instance hierarchy based on the above type hierarchy (acts as an intermediate 'in memory' representation), populate the simple types (or attribute) with their respective values in the new instance of the type hierarchy as I parse through the XML and write it to the file. However, due to the high number of class files that are going to be generated on the system, this option cannot be considered. (For each XSD uploaded, the system would have to generate a set of class files and this is not acceptable)

XSOM
Using XSOM, i can create a simple tree structure with two user defined types 'ComplexType' and 'SimpleType'

ComplexType
Name
Set of simple types
Set of complex types

SimpleTypes
Name
Value

The problem here is I am not sure how I can go about parsing the XML and have an intermediate representation which could be committed to the CSV file.

DOM is not an option at all due to memory constraints.

Example:
XML

<countries>
<country>
<id>1</id>
<name>Country1</name>
<states>
<state>
<id>1</id>
<name>State1</name>
</state>
<state>
<id>2</id>
<name>State2</name>
</state>
</states>
</country>
<country>
<id>2</id>
<name>Country2</name>
<states>
<state>
<id>3</id>
<name>State3</name>
</state>
</states>
</country>
</countries>

CSV

countries_country_id,countries_country_name,countries_country_states_state_id,countries_country_states_state_name
1,Country1,1,State1
1,Country1,2,State2
2,Country2,1,State3


What I have in mind for the intermediate structure to represent a single line of record in the CSV file is an array of key-value pairs. I could parse through the XSD, identify all the simple types (along with their position relative to the root) and initialize the array keys with the simple type names.
eg:
[(countries_country_id=''),(countries_country_name=''),(countries_country_states_state_id=''),(countries_country_states_state_name='')]

Further on, I could parse the XML (using SAX/StAX), populate the above array one simple type value at a time. I could make use of a stack to maintain the state information by pushing the element names. Once I encounter a type that is already present in the stack, it will identify the end of a line (or record) in the CSV file. The contents of the array will now be written to the CSV file, its contents will be reset (sanity check) and the parsing process will continue).

I was also thinking of replacing the array of key-value pairs with a type that extends HashMap, but I really do not need the flexiblity a Hashmap offers. Also, the ordering of elements might become an issue and I would have to move over to a Treemap (performance hit?). I feel that in the case of a deeply nested XML, hashing will provide a considerable performance improvement. So should I implement my own custom hashtable based fixed size array?

Awaiting your feedback and comments. I have been burning my head for the past few days trying to figure out the most efficient and scalable way of doing this. If you could provide me with any better alternative approaches, it would be really helpful.

Many Thanks,
.J.
  Message #225298 Post reply Post reply Post reply Go to top Go to top Go to top

How much scalability and speed do you actually need?

Posted by: Anthony Coates on January 12, 2007 in response to Message #225295
You are certainly right that binding tools (JAXB, etc.) aren't right for this kind of problem. You need to be working with an API that handles arbitrary XML. From what you have written, you could choose to ignore the Schema, all of the information that you need will be in the XML document itself.

While I understand that you want "the most efficient and scalable way of doing this", most real development problems have to find a balance between computational efficiency, scalability, and the time and money required to code/debug/optimise the solution. For that reason, I don't recommend that you try to make the solution any faster or more scalable that it really needs to be (rarely is the XML processing the slowest point in any complete business process, at least in my experience), not if you can save some development/debugging time/cost by being less ambitious.

From a programming point of view, a "DOM-like" API is probably the easiest for this. If the DOM uses too much memory for your application (and you said that it does), then look at XOM or VTD-XML. If these also use too much memory for your application, then you are correct that your only remaining choice (but not such a bad one for this problem) is to use SAX or StAX and maintain your own stack. I've certainly written this kind of code before (it's similar to what Excel does when it imports an XML document), and it's not too difficult.

If you use SAX or StAX (and if you don't use the Schema), you will have to parse each document twice, once to get the structure so that you can generate the columns and the mapping from the XML paths to the columns, and a second time to generate the data rows.

If the files really are so big that parsing them twice is a problem, then the alternative is to process the Schema first to get the structure. I often use XSLT for that kind of thing, but you could use XOM if you prefer to write Java, or you could try the Eclipse API for Schemas (the "org.eclipse.xsd" packages). Once you have the structure information, you can then use SAX or StAX to process the XML file just once and generate the data rows.

Cheers, Tony.
--
Anthony B. Coates
Author, "XML APIs" chapter,
"Advanced XML Applications from the Experts at The XML Guild"
http://www.amazon.com/XML-Power-Comprehensive-Guide-Guides/dp/1598632140/

  Message #225299 Post reply Post reply Post reply Go to top Go to top Go to top

Re: Dynamic translation of XML into CSV using XSD information

Posted by: Johnny Bravo on January 12, 2007 in response to Message #225295
Thank you for your feedback Tony.

In some of the cases, the XML files are so huge (100mb+) that considering a DOM like API is way out of the question. The same reason rules out the possibility of parsing the XML twice when in fact the users of the product are interested in providing both the XSD and the XML.

I'll have a look at the Eclipse API for schemas. Do you have any idea on how it compares with XSOM? I have not found much documentation for XSOM other than the javadoc.

Recent active threads Recent active threads Recent active threads More More More
Stateful Webservice in java
Google cloud languages: Python pedals to peak? Java on the way?
OpenID and Crowd SSO: TheServerSide Video Tech Brief
Jt - Java Pattern Oriented Framework (Jt 4.5)
IPhone App Development with JSF
Web as the Platform: Day 1 at the Ajax Experience
Need help for login page using java servlet
SAP Asks Sun/Oracle to Let Java Be Free
Registration for TheServerSide Java Symposium Las Vegas now open
Use Sun SPOTs as your build canary
More active threads »
Top posters of the weekTop posters of the weekTop posters of the week
This list contains the members who have made the most posts in all forums over the last 7 days:
  1. Dan Evans
  2. James Watson
  3. William Louth
  4. sara foster
  5. Chief Thrall
Hot threads Hot threads Hot threads More hot threads More hot threads More hot threads

Object pooling is now a serious performance loss

Brian Goetz continues to lift the lid and peak into the inner workings of Java in Java Urban Performance Legends. In this article he exposes the fallacy behind some of the more common performance myths found in the annals of the JVM.
(93 comments, last posted February 06, 2009)

Beyond Java

Bruce Tate, author of Better, Faster Lighter Java and Bitter EJB has come out with a new book called Beyond Java. Bruce has an epiphany about the future of software development. Does it include Java?
(770 comments, last posted September 23, 2009)

Three forms of AJAX: solid, liquid and gas.

Looks like today AJAX concept have several interpretations. We can distinguish different approaches of AJAX integration. Can they co-exist within the same application? Can we talk about layered AJAX integration?
(68 comments, last posted May 08, 2008)

Design-Time API Promises to make Java more like VB

Artima has published a short article describing the Design-Time API for JavaBeans, which was recently approved as JSR 273. This API promises to bring VB-like ease to Java development, but may face a cultural bias among Java developers who tend to think more in terms of class libraries than components.
(225 comments, last posted November 19, 2009)

Will Sun be that target of a management buyout?

There is plenty of speculation today regarding a potential buyout of Sun Microsystems by Scott McNealy and Silver Lake Partners. How would privatization of Sun affect Java?
(16 comments, last posted May 15, 2009)
More hot threads »

News | Blogs | Discussions | Tech talks | Patterns | Reviews | White Papers | Downloads | Articles | Media kit | About
Java Solutions
All Content Copyright ©2007 TheServerSide Privacy Policy      Powered by JIVE
Site Map