Sunday, August 8, 2010

On XML and data formats

In many discussions of XML, there seems to be a faction of programmers who are completely dead-set against XML.  They'll insist on JSON, or YAML, or any other cool technology that isn't supported in that 5-year-old version of whatever language your company is still running.  The usual complaints leveled against XML-based formats are verbosity and the complexity of the DOM.  (Sometimes, leading or trailing whitespace on element contents in pretty-printed XML will bite a project, but this never seems to come up in internet flamewars.)

The really young, or maybe just incurably naive, programmers will even chime in that anything that can be done in XML can be done, or even done better, in JSON.  I even thought this once, until I tried to write a generic XML-to-JSON converter, which showed me how wrong I was.
Ultimately, I learned that XML—based on SGML, perhaps with some lessons learned from HTML—is at its heart a document format, and it makes the most sense when used to mark up document text.  XML tags and attributes contain all sorts of useful metadata, and the angle brackets isolate it from the core text so well that something intelligible may still come out if you strip all the tags and examine nothing but the element content.

Corollary: if something useful doesn't result after the XML tags are stripped, then XML was not the optimal choice of format.  Things like XML-RPC come to mind for this.  In those cases, XML was chosen for familiarity more than anything else.

When I was attempting to write my converter, I started simple. I figured out how to handle <tag attribute="value">text content</tag> without collision. In the upper hash table, "tag" would point to another hash table with keys for each attribute, plus a special non-colliding one for the element's content.  Then, I asked myself, "What happens if I need to encode '<strong>some <em>great</em> markup</strong>'?"

Answer: My JSON structure would be a reimplementation of the DOM tree.  To faithfully represent the XML, I would have to know how many children an element had, what those child types were, and what order they belonged in.  To extract a given node's text content, I'd have to visit all the children of that node and join their text content together, in order, just like XML.  (If the XML API has an innerText call, it's just a convenient way to ask the API to do the exact same task.)

I was forced to conclude that, if there was no way to convert XML to JSON other than to create a structure that describes every nuance of the XML source, then XML must be strictly more expressive than JSON.  The fact that you can represent XML in JSON is only interesting in an academic manner, similar to how you can write programs with a few fundamental operations—but a high-level language with a broad standard library is generally more productive.

Once the difference between XML and JSON is understood, then it becomes much simpler to assess a problem and determine the correct approach.  Are you sending data structures, such as function names and parameter lists, back and forth?  JSON-RPC is the more natural choice.  Do you need to mark portions of a text with links, formatting, attribution, or other information that spans arbitrary ranges of the text?  XML will probably be easier to handle.

Occasionally, the best solution may be passing a JSON data structure, with one or more values containing XML as string data, to be parsed separately from the JSON at the destination.  Sometimes the cry for "Only one format!" for simplicity's sake actually makes the single-format solution more complex overall.  Maybe there's a cost to loading both XML and JSON libraries, but in general, costs should be cut only when they have been proven to be excessive; otherwise, it is too easy to be bit wise and word foolish.

Author's Note: The term "DOM tree" in this post refers specifically to the DOM as a conceptual representation of an XML structure, not the Java-oriented API of the same name.  The DOM API is a terrible way to work with the DOM tree in any other language, and the author's mild fondness for the tree should not, in any circumstance, be mistaken for condoning the API.

1 comment:

rpbouman said...

This post is now a year old, so I'm late to the party, but this is by far the most elegant explanation I ever read concerning the distinction between a document format and a data format.