IBM Devworks Article: Learn 10 good XML usage habits

Discussions

News: IBM Devworks Article: Learn 10 good XML usage habits

  1. Learning good habits in XML can make all the difference between taking advantage of the functionality offered by XML and struggling against the XML standard to get the basics of validation and parsing right. In this IBM DevWorks article by Martin Brown, discover 10 good habits that improve your effectiveness and efficiency as you work with XML documents and data.
    Here are the top 10 good XML habits to adopt:
    1. Define your XML and encoding
    2. Use a DTD or XSD
    3. Remember to validate
    4. Validation isn't always the answer
    5. XML structure versus attributes
    6. Use XPath to find information
    7. You don't always need a parser to extract information
    8. When to use SAX over DOM parsing
    9. When to use DOM over SAX parsing
    10. Use a good XML editor
    The XPath hint:
    When working with XML data, finding the information you want can be complex. You can, of course, write a parser to pick out the material that you need, but sometimes, you really just need to find a small fragment of the information in the file very quickly. For example, if you wanted to extract a list of all the countries in your contacts XML file so that you could see how widely spread your contacts were, you could use XPath to pick out the information. XPath enables you to pull out the data from an XML file by using the structure of the XML file as part of the query. You can, for example, extract the data for a specific element by giving the path to the element within the XML file:$ xpath contacts.xml '//contact/address/country'You can dissect the content like this:
    • The initial double slash (//) specifies to look anywhere within the document for the specified element (contact).
    • The next slash and element name specify the next element to pick out (address)—that is, look for the 'ddress element within the contact element.
    • The final one repeats the process, this time picking the country.Note that in the example, you qualified the type of address to select the information from, so it will pick all addresses. You can see the result of the XPath query.$ xpath contacts.xml '//contact/address/country' Found 3 nodes: -- NODE -- USA-- NODE -- USA-- NODE -- USAIf you want to pick out more specific data, you can specify the element contents, or attribute contents that you want to match. For example, to select only mobile phone numbers, you need to specify the attribute type and value. To do this, use an at sign (@), which specifies that you want to search an attribute, and then specify the value you want to match.$ xpath contacts.xml '//contact/phone[@type="mobile"]' Found 1 nodes: -- NODE -- 123 456 7890[These] use a command line tool. Many XML toolkits provide native methods to work with XPath elements, and you can extract data using the XPath specification to use in your applications directly, without having to work with a parser to get the information.
    Any hints you'd offer in addition to these? One that I thought was missing was:
    1. Don't mix text data with child nodes - In other words, if you're going to add content as a child node to a given XML node, you shouldn't add more content as text, like this: baz Here's a text node, a sibling of the 'bar' node
  2. Also, 11. Prefer convention over configuration. 12. Use XML judiciously for configuration files. Convention over configuration requires a fair amount of thought. You have to see if you really need the flexibility that configuration gives you. Most of the times, (even if you are writing a public API/Framework) you shouldn't be needing the entire flexibility that comes with configuration.
  3. Some corrections I would make to the article: 1) When your XML file has the correct syntax like a root element in place and begin and end tags where they belong - that's called being "WELL-FORMED". This has nothing to do with XML file being VALID. Validity of the XML file comes into play when you VALIDATE the XML file against a DTD or SCHEMA(XSD). 2) "Validation isn't always the answer... In short, XML validation only ensures the structure is correct, not the data..." SCHEMA(XSD) CAN indeed provide very useful ways to ensure that your data is correct. Two quick and useful examples: 1) By simply specifying xsd:date as the data type of an element, you can make sure that only valid dates are entered. 2) You can even use regular expressions to specify proper element content, so this might be useful for specifying phone numbers for example. 3) Good XML editor is definitely key. OxygenXML is excellent with support for DTD, Schemas, XSLT and even XQuery, but it's not free (the trial version is). A nice free editor is XML Buddy http://xmlbuddy.com/. - Kalman, Moderator for http://groups.yahoo.com/group/JHUXML/
  4. Specifying an element is a date is one thing but it is much more difficult to specify the date is correct. For instance the defining in XSD that the trade settlement data is greater that the trade execution date but by no more than 5 business days is not trivial. I would argue that this type of logic is better validated elsewhere.
  5. Sorry for the typos[ Go to top ]

    Trade Settlement Data should be Trade Settlement Date. My point is I agree with the author, that XML validation via XSD is not always the answer.
  6. Brit, You are correct that not EVERYTHING can be done in the schema. And the business condition you are talking about might indeed be better off left for the application processing the XML to verify. BUT... why would you not take advantage of the vast validation techniques that schema DOES provide? Why would you wait till the application layer to all of the sudden realize that February 31, 2008 is not a valid date when simple validation against XSD can tell you that? - Kalman
  7. Brit,

    You are correct that not EVERYTHING can be done in the schema. And the business condition you are talking about might indeed be better off left for the application processing the XML to verify. BUT... why would you not take advantage of the vast validation techniques that schema DOES provide? Why would you wait till the application layer to all of the sudden realize that February 31, 2008 is not a valid date when simple validation against XSD can tell you that?

    - Kalman
    A lot of this depends on what you want to do with the bad data. If you use a validating parser, you will likely get some pretty ugly error messages. If you want to provide a nice message telling a user what went wrong, schema validation is unlikely to be the right answer. I've often thought that this is something that could be improved upon so if anyone thinks this is outdated, please speak up. Having said that, the reality is the XSD lacks a lot of really useful validations. One that I sorely missed in the past was a one-or-more (really n-or-more) constraint on a sequence. For example, if I have an element with children a,b,c,d there's no clean way to specify that at least one of the children is required but all are allowed. RELAX-NG has a richer validation model but XSD has become the defacto standard.
  8. Brit,

    You are correct that not EVERYTHING can be done in the schema. And the business condition you are talking about might indeed be better off left for the application processing the XML to verify. BUT... why would you not take advantage of the vast validation techniques that schema DOES provide? Why would you wait till the application layer to all of the sudden realize that February 31, 2008 is not a valid date when simple validation against XSD can tell you that?

    - Kalman


    A lot of this depends on what you want to do with the bad data. If you use a validating parser, you will likely get some pretty ugly error messages. If you want to provide a nice message telling a user what went wrong, schema validation is unlikely to be the right answer. I've often thought that this is something that could be improved upon so if anyone thinks this is outdated, please speak up.

    Having said that, the reality is the XSD lacks a lot of really useful validations. One that I sorely missed in the past was a one-or-more (really n-or-more) constraint on a sequence. For example, if I have an element with children a,b,c,d there's no clean way to specify that at least one of the children is required but all are allowed.

    RELAX-NG has a richer validation model but XSD has become the defacto standard.
    RTFM for one-or-many, you use . As for RELAX-NG, it seemed less useful than XSD e.g. useful features are missing and less APIs support it. I also think that the DOM API should be scrapped for a proper XML tree API, for all languages e.g. I found XOM (with a built-in subset of XPATH) much easier to use for Java. In most case I prefer to use a tolerant XSD and an XSD code generation API like Castor for most XML, after identifying the root element, via a sliding window buffer and regular expressions.
  9. Pull Parsers[ Go to top ]

    Talking only about SAX and DOM isn't adequate anymore. Pull parsers and the StAX spec are what I'd use for most things now. This spec pretty much delivers on "easy as DOM, fast as SAX" and I'd say it should get the first look over both in most cases.
  10. Re: Pull Parsers[ Go to top ]

    Talking only about SAX and DOM isn't adequate anymore. Pull parsers and the StAX spec are what I'd use for most things now. This spec pretty much delivers on "easy as DOM, fast as SAX" and I'd say it should get the first look over both in most cases.
    Aside from that, the author makes some statements about the value of DOM over SAX that I disagree with. If you are building an object graph from the stream, all you need is a stack to keep track of where the next object goes. And the statement that you have to parse the document multiple times with SAX to handle it multiple times is nonsense. You can create a handler that aggregates any number of handlers and forwards the events. There are very few cases were DOM is better than SAX and the reason it's favored is because it's more aligned with the procedural mindset.
  11. XML Editor - XPontus is[ Go to top ]

    I've found XPontus to be quite a nice open source XML editor - a handy replacement for XML Spy. You can open it using Java Web Start. See http://xpontus.sourceforge.net/