Geeky nitpicking herein
Mar. 26th, 2008 10:41 pmAs I sit here with the pangs of a mysteriously-aching shoulder slowly receding before happy non-steroidal chemicals, I contemplate another pain in my neck: XML.
This is supposed to be a a simple, lowest common denominator, easy to work with representation for all kinds of data. So why is using it in any serious way so much of a pain? I leave aside, for the moment, that it's really only suited by nature to tree-shaped data*, as that's it's designed purpose, and does indeed suffice for quite a wide array of practical data, if by no means representing the implied universal storage mechanism. And yes, it's bulky, but it is compressible, and it's supposed to trade off size for simplicity. My real problem is that implementations of XML handlers that I've worked with in multiple languages are just terrible to work with.
I have been doing most of my recent XML work in C# and Java, which share generally the same paradigm, and share a general badness. Much of it is simply mismatch to the language, e.g. a list of child nodes is a NodeList object, which is not a collection. The methods to get Elements return Nodes, which must be manually cast into Elements. Traversing through a long branch of some tree is either an enormous Russian doll project or a time to invoke a special XPathNavigator. Some decisions are simply bizarre: Java's core XML implementation (which is spread across multiple top-level namespaces, namely javax and org) does not include any mechanism at all for writing out the contents of a Document, leaving that as an exercise for third parties.
Other problems are more basic. Tree manipulations are naturally done on a node basis, yet I cannot manipulate nodes freely without the involvement of a parent document. It is natural both to move through some XML as a tokenized stream and as a tree, but the tools for doing so are far from smoothly interoperable.
So, having worked with a few libraries and looked over a few others, am I just missing the existence of genuinely good methods for manipulating this stuff? I have found several times that writing my own utilities to do anything interesting saves a lot of swearing, but do I really need to implement a whole handling system to get something usable? Here are the things that I would like to be easy in an XML system,** some of which I would be alright implementing myself on top of a basically functional system, or have in fact already done so:
So... does anyone know of any systems that provide all, or even most, of these features through a consistent mechanism, preferably in keeping with the mechanisms of the surrounding language? If not, what other features should I add to the list for when I write my own system?
*I do wonder, though, whether there's a standard pointer implementation to put on top of your XML, to build arbitrary graphs. You can create ID attributes and invoke them in reference nodes later, but is there a well-known standard for how that's encoded and read back?
**Ok, so I'd really like it to have some more flexible SGML options as well, at least case-insensitivity and value-free attributes, but I recognize that this may be a vain hope.
***XSLT does this, but it's a hugely more cumbersome system than a tree-to-tree mapping need be.
****Writing and reading files in this mode is a special bonus that we kind of hacked into the logs at Convoq, because we wanted XML logs, but you can't continuously write out data to the end of a log file and have it be valid XML, nor do a tail -f equivalent. TK and I were singularly underwhelmed by the single-rooted document dictum.
This is supposed to be a a simple, lowest common denominator, easy to work with representation for all kinds of data. So why is using it in any serious way so much of a pain? I leave aside, for the moment, that it's really only suited by nature to tree-shaped data*, as that's it's designed purpose, and does indeed suffice for quite a wide array of practical data, if by no means representing the implied universal storage mechanism. And yes, it's bulky, but it is compressible, and it's supposed to trade off size for simplicity. My real problem is that implementations of XML handlers that I've worked with in multiple languages are just terrible to work with.
I have been doing most of my recent XML work in C# and Java, which share generally the same paradigm, and share a general badness. Much of it is simply mismatch to the language, e.g. a list of child nodes is a NodeList object, which is not a collection. The methods to get Elements return Nodes, which must be manually cast into Elements. Traversing through a long branch of some tree is either an enormous Russian doll project or a time to invoke a special XPathNavigator. Some decisions are simply bizarre: Java's core XML implementation (which is spread across multiple top-level namespaces, namely javax and org) does not include any mechanism at all for writing out the contents of a Document, leaving that as an exercise for third parties.
Other problems are more basic. Tree manipulations are naturally done on a node basis, yet I cannot manipulate nodes freely without the involvement of a parent document. It is natural both to move through some XML as a tokenized stream and as a tree, but the tools for doing so are far from smoothly interoperable.
So, having worked with a few libraries and looked over a few others, am I just missing the existence of genuinely good methods for manipulating this stuff? I have found several times that writing my own utilities to do anything interesting saves a lot of swearing, but do I really need to implement a whole handling system to get something usable? Here are the things that I would like to be easy in an XML system,** some of which I would be alright implementing myself on top of a basically functional system, or have in fact already done so:
- Tree manipulations
- Creating and compositing nodes freely
- Traversing in one step down a known path to a given node
- Creating an arbitrary leaf with all its necessary intermediate nodes from the relative root
- Getting the intersection of two trees
- Getting the union of two trees
- Raw data manipulations
- Free conversion between strings and trees
- Stepping forward and backward through tokens
- Working with any chosen subset of elements, text/PCDATA, and comments, and ignoring the others
- Accessing elements by properties (name, attributes, etc) regardless of tree position
- Structured data manipulations
- Validating a document against a DTD with informative notification of differences
- Validating an element and its sub-tree against a DTD
- Direct transformational mapping between a source DTD and a result DTD***
- Implicit handling of the Document Element, e.g. wrapping multiple root nodes under one automatic "root" to meet XML specs****
So... does anyone know of any systems that provide all, or even most, of these features through a consistent mechanism, preferably in keeping with the mechanisms of the surrounding language? If not, what other features should I add to the list for when I write my own system?
*I do wonder, though, whether there's a standard pointer implementation to put on top of your XML, to build arbitrary graphs. You can create ID attributes and invoke them in reference nodes later, but is there a well-known standard for how that's encoded and read back?
**Ok, so I'd really like it to have some more flexible SGML options as well, at least case-insensitivity and value-free attributes, but I recognize that this may be a vain hope.
***XSLT does this, but it's a hugely more cumbersome system than a tree-to-tree mapping need be.
****Writing and reading files in this mode is a special bonus that we kind of hacked into the logs at Convoq, because we wanted XML logs, but you can't continuously write out data to the end of a log file and have it be valid XML, nor do a tail -f equivalent. TK and I were singularly underwhelmed by the single-rooted document dictum.