Where Should XML Go?

Posted 2 Feb 2005 at 11:29 UTC by Ankh Share This

XML has been a W3C Recommendation for over five years. Last year we published a relatively small update, XML 1.1, which has not yet seen wide adoption. People are asking us for more efficient ways to process XML, for processing model and data model definitions, and other changes. We've nearly finished XML Query. What's next?

The World Wide Web Consortium published the Extensible Markup Language (XML) as a Recommendation in 1998. We envisioned use cases primarily in technical documentation, although a number of academic text-based projects were also very significant. Motivation for XML had come from a number of sources:

  • The late Yuri Rubinksy, then president of SoftQuad Inc, a visionary in the field of structured (semantic) markup, and also a champion for assistive technologies, had given a number of talks about the importance of sharing meaning;
  • C. Michael Sperberg-McQueen gave a talk at SGML '95 about SGML (the Standard Generalized Markup Language) as infrastructure; he suggested that we needed marked up information to become part of the invisible infrastructure of computing, much like the many service tunnels under the city of Chicago;
  • SoftQuad Inc. had shipped (in 1994) a Netscape browser plugin that displayed SGML documents, downloading additional definitions and components from the remote Web server if necessary, and also providing a way for people to share sets of annotations and superimposed links.

    Panorama was generating a lot of interest and attention in the technical documentation world as people started to understand that the World Wide Web could be used as an instant delivery mechanism.

  • Jon Bosak was at Novell, where he was responsible for managing over a hundred thousand pages of documentation. He was an early adopter of Panorama, but was worried about committing to a proprietary technology: although SGML was a published standard, and all of Novell's documentation was already in SGML, the way that Panorama supported only a subset of SGML, and the way it was deployed on the Web, were not standard.

    Jon searched for a group to work standardise "Web SGML", but unfortunately Yuri had just died, and it seemed that there wasn't anyone else who could persuade the ISO SGML Committee to look at this problem: perhaps they were still busy hoping the Web would go away.

  • The W3C had published the HTML specification as a Recommendation, and had a number of people (including Dan Connolly) who werefamiliar with SGML and the tools around it; Jon took his problem to the W3C and a new Working Group was born.

So you can see that there's a history of writing, of documentation; if I were to introduce more of the people involved in the early days of XML, and more of the projects, you'd see this even more strongly.

Three main specifications were envisioned, with a fourth following close on their heels: a way to style XML (XSL and XSLT, like SGML's DSSSL); a way to link within and between XML documents (XLink, a tiny subset of SGML HyTime), a way to search and query XML (XQuery, another tiny subset of SGML HyTime) and also a way to constrain the structure and content of XML documents so you could tell if a document conformed to a predefined set of expectations (XML Schema, like an SGML DTD).

W3C has since published XSL, XSLT, Xlink and Schema, and is working on a suite of related specifications with XQuery - Xpath 2, XSLT 2, XML Schema 1.1 and of course XML Query 1.0. There's also an XSL-FO 1.1 in the works (XSL-FO is the formatting part of XSL, as opposed to XSLT, the transformation part).

The big question is this: what should we do next?

Technical documentation was once a primary use case for the World Wide Web, but that's no longer true. It's no longer primary for XML either, although it's still very important. Instead, technical documentation is now one of a great many uses of XML. A Web service to provide a current stock price has, at first glance, very little in common with a 150,000-page aircraft repair manual.

As the uses of XML have spread, limitations and weaknesses have become apparent. Many of those limitations apply to technical documents even though it took widely differing applications to give us the perspective needed to understand them.

  • The natural verbosity of XML is excellent for robustness. This is essential for situations where the "correctness" of structured data cannot be automatically verified: a mismatched end-tag must be repaired by hand, because a computer can't generally tell if author or title was intended.

    With verboseness, however, comes higher bandwidth usage and also greater time to read, write and process.

  • XML is a textual format, and is designed to be processed from start to end. This makes it difficult or impossible to start in the middle (for example, with a continuous news feed) or to jump directly to the Nth page of a large book.

  • XML has a number of features and constraints on those features that can be a pain to implement but that are rarely used. Some of them, like Notations, are not a good fit with the World Wide Web, and echo XML's SGML pre-Internet heritage. Others, like parameter entities, can be difficult to understand and use,and have an arcane syntax.

    There's a cost to such features. XML is already substantially simpler than SGML, but it could be even simpler.

There are many other such items. The W3C has a Working Group currently devoted to working out whether a more efficient way to transfer XML between applications or systems should be published at W3C. This is sometimes called bianry XML although that's a misleading name for a number of reasons, and I personally prefer efficient interchange.

If W3C does publish a new way to interchange XML, we risk damaging the story that every processor can understand every document. Strictly speaking this is already a fiction because of encodings, and because of XML 1.1, so perhaps this isn't such a big deal as it might sound.

One way to introduce an efficient interchange format might be to publish an XML 2.0 with two separate syntaxes: the human-editable textual format and the more efficient and probably binary format. But if we do that we've changed XML. No XML processor today can handle an XML 2.0 document, since there isn't such a thing.

Should we change XML?

XML doesn't moo at taxis: it's not sacred. It's a spec that we should keep around as long as it's useful. But if we change it we have to wonder what other changes should we make. To determine that, we have to ask people who are using XML today what changes they would like to see, and also ask people not using XML today exactly why that's so.

So here are my questions for the Open Source and Free Software community:

  1. People writing software and representing structured information (whether it's a configuration file or documentation or data) - if you're not using XML, what's stopping you?

  2. People using XML: what are the edge cases, the limits, the places where you've tried to push XML and failed?

  3. What (if anything) should we change?

Finally, I should note that I'm not trying to push XML as a single solution for all problems. Rather, I want to discover places where it's almost a solution: places where you think it's the right answer but you can't use it for some reason. Or reasons not to make changes, of course.

XML for configurations? No, not that, posted 2 Feb 2005 at 18:53 UTC by gwolf » (Journeyer)

XML is a great idea for data interchange, for RPC, and even for _some_ sorts of configurations - specifically, for computer-manipulated configuration. XML can be human-parsed, but it is not meant for that. Configuration files are something a human can spend a long time in, and that a program parses only ocassionally - at startup, or every time it detects a change in it. It is much more convenient to use human-friendly formats for configuration files. If you have hand-configured things such as Jboss (why does the Java crowd like XML so much? :) ) you will understand.
Now, XML isn't quite friendly and easy for the computer itself - I'd surely go over YAML (or look at an article on YAML) anytime, as it is at least as simple as XML to process and _much_ easier for a human to grok. But even there, I would not push for YAML as a universal configuration format - Yes, mostly any configuration can be represented in XML or YAML, but you very often probably don't want that overhead.

Re: XML for configurations? No, not that, posted 2 Feb 2005 at 20:37 UTC by Ankh » (Master)

Thanks for taking the time to reply!

The biggest value of XML is its ubiquity. It's everywhere.

Personally I hate and despise complex config file formats like the awful bind stuff with curly braces and some idiosyncratic syntax you learn once, use, and move on with your life. Sendmail was worse of course, and ircd is a pain too. There are so many config files on Unix with incompatible syntaxes simply because each programmer has the hubris to think that people will care enough about that package to learn some syntax or other. I've done it myself in the past too.

Whether XML is the right answer or not, go write some UI tools that will parse fstab, /etc/group, ttytab, inittab, /var/db/named/*, apache.conf, TeX font config files, GhostScript config files, X11 fonts.dir... and give a user interface to editing them and checking they're plausible before overwriting the originals. It's not easy.

So I'm with Jim Gettys here, let's drain the swamp. XML might not be perfect for this, but it's better than today's nonsensical mess.


XML for configuration and data, posted 3 Feb 2005 at 04:14 UTC by tk » (Observer)

I'm sorry to say I'm with gwolf here. Writing code to validate and extract data from /etc/group may not be trivial, but I feel that writing code to extract data from XML is just an exercise in pain. And that was using Perl's XML::Parser. (!!!)

I don't see how /etc/sendmail.cf will suddenly become easier to write if everyone switches to XML. And is the /etc/group format really that hard to learn?

In any case, since configuration files are meant to be tweaked by users, it makes more sense to try to save the user's time, even if at the expense of the coder's time in coding a parser. And in any case, XML isn't that easy to parse either (or it's just me).

re: XML for configuration and data, posted 3 Feb 2005 at 06:22 UTC by jamesh » (Master)

tk: While writing some code that can parse and generate /etc/group files might be easier than writing an XML parser, writing code to do the same for every file in /etc is a lot more difficult.

If all that information was represented in one format, and you had a parser that could read that format and a validator that could check particular document instances conform to a particular schema, it would be quite easy to read, modify and write back any of those documents reliably without introducing syntax errors. For many configuration files, a tree oriented format like XML fits the bill.

As for the difficulty of using an XML parser, that has a lot to do with the particular software you use. Some parsers make it very easy to extract data from documents (eg. letting you perform XPath queries on the document).

Get rid of features that nobody uses, posted 3 Feb 2005 at 11:43 UTC by tjansen » (Journeyer)

A year ago I came up with a list of 10 things i hate about XML. It's mostly about removing rarely used features, and I think that these points are still valid.

Re: Get rid of features that nobody uses, posted 3 Feb 2005 at 14:01 UTC by Ankh » (Master)

Thank you for pointing that out.

I think what you'd be suggesting is creating a profile or subset of XML -- I'll call it "XML Core" -- that has fewer features and hence a more regular syntax.

We're close to being able to remove DTDs: the xml:id specification takes away one more use case. Character entities, e.g. &companyname;, can in many cases be replaced by XInclude, but not all cases.

We've used processing instructions to associate an XSLT transformation with an XML document; I personally think this was a poor design, and we need a more general way of associating resources, perhaps something like RDDL but on a per-document basis.

CDATA sections are one of those features that gets into specs because they are useful to the specification authors :-) We should have either had two types of text, with distinguished element names or something, or not had CDATA, I agree. The resulting information items are identical, though: there should be no expectation that a CDATA section survives round-tripping.

UTF-8 doesn't fly too well in China or Japan, so UTF-16 is also needed. For the forseeable future people will need other encodings, for example because of missing characters in Unicode.

An element might contain only elements and whitespace but that does not mean that you can ignore the whitespace.

<p><a href="xxx">consider</a> <em>this p element</em></p>

I agree with you that namespaces should be in the basic specification. The reason they aren't is largely lack of resources: the cost of making the change would be higher than the benefit. But if we have to make a big change to the spec, it might be a good time to do that, along with xml:base and friends.

xml:lang is actually very useful. The XML processor needs to deliver all of the text (consider writing the identity transform in XSLT), but other specifications do make good use of the attribute.

I've met quite a lot of people who prefer XML Schema to Relax NG; it turns out to depend on what you're doing, and your environment, and your background. For my part W3C XML Schema is uncomfortably large and complex, but it also has a richer typing system. At any rate, it's also the basis for XSLT 2 and for XML Query, and is widely implemented and used. So it's harder to see what to do here.

Anyway, many thanks for your input. What advantage for you would there be in a stripped down "XML Core" specification? For example, how does making the parser smaller help you in practice? That's the sort of information that enables me to make arguments in favour of change :-)



Xml Core Parser, posted 3 Feb 2005 at 16:37 UTC by tjansen » (Journeyer)

The complexity of the parser is just a part of the problem. Because 'XML' is so much more complex than 'XML Core', it takes much longer before you can claim that you actually know and understand XML. I think you can learn elements, attributes, namespaces and text within two hours. But understanding full XML, with all those nasty details like parameter entity references, takes much longer.

Right now it may seem reasonable to keep the complex features in order to stay backward-compatible with XML 1.0 and SGML. The problem is that it reduces the lifespan of XML. In 10 years, when SGML is finally forgotten and everybody uses XML Schema or Relax NG or whatever alternatives will be available then, it will make XML look like a dinosaur - full with kludges for outdated systems that the next generation of developers will not understand anymore. Eventually XML will share the fate of ASN.1. In the long term only simple designs can survive.

But today the actual problem is not the complexity of the parser, it's the complexity of the parser's API. The XML spec requires XML parsers to return all those rarely used features: pull parsers return things like processing instructions; a DOM tree uses different classes for CData sections than for regular Text. Not even the XPath data model is free of processing instructions and comments in the document tree.

When you write a program for such an API, you must be prepared for unexpected nodes; you must be prepared that you can get a processing instruction at any time, even if your schema does not mention it and you expect an element; you must be prepared that an element that is supposed to contain a number may contain 4 child elements (Text+CData+Text+Comment). In reality, many applications are not and will fail when you give them complicated (but valid) XML documents. APIs that expose you to XML's full feature set make XML processing more error prone than the processing of simpler data models would be.

It's possible to write a 'XML Core API' that is able to parse regular XML documents. There are two ways to do this: either just omit everything that is not included in XML Core, or translate these features into regular Elements in a special namespace.

That's why I have given up on using today's XML APIs directly (for most purposes). It is so incredibly complex to use them reliably. I use XML quite a lot in different programming environments, and the first thing I always do is write pull-parsing wrappers that essentially reduce XML to the 'XML Core' subset. With them, reading XML is like deserialization in Java, just that you need to specify an element name for each piece of data.

To parse a XML snippet like

	<title>FooBar for Dummies</title>

i merely need to write to

XmlReader xr = someWayToGetIt()
String title = xr.getString("title")
while xr.enterOptionalElement("author")
        String firstName = xr.getString("firstName")
        String lastName = xr.getString("lastName")

IMHO this style is the only acceptable way of parsing data-oriented XML, beside Javascript's E4X extension and heavy usage of XPath. The rest is just a huge mess. I have seen too many programs that attempt to use DOM or SAX for parsing data-oriented XML, and they all have either bloated or buggy XML reading code. DOM and SAX may be fine for mark-up style XML documents like HTML and for 'general purpose' XML applications that are not restricted to a schema, but not for reading data.

An element might contain only elements and whitespace but that does not mean that you can ignore the whitespace.

Your example is certainly valid for mark-up texts, but I don't think that it should be the default (thus without xml:space="preserved?). Maybe that's because I am using XML mostly for data/SOAP, and skipping whitespace is one of the most annoying things when you parse data.

I agree with you that namespaces should be in the basic specification. The reason they aren't is largely lack of resources: the cost of making the change would be higher than the benefit.

The benefit would be that the use of the colon in elements would be defined. Colons in the elements of non-namespace-aware documents can result in funny effects when encountered by namespace-aware parsers, unless they have two modes of operations.

xml:lang is actually very useful. The XML processor needs to deliver all of the text (consider writing the identity transform in XSLT), but other specifications do make good use of the attribute.

But why does it get the privilege of being in the XML spec and using the 'xml' prefix, even if it is not interpreted by the XML parser? Other specifications like XLink are comparable to xml:lang, but do not use the 'xml' prefix.

Re: Xml Core Parser, posted 6 Feb 2005 at 04:50 UTC by Ankh » (Master)

It's not clear to me that it's worth changing XML to favour data-over-SOAP over document-oriented XML; if we'd anticipated XML-RPC we might have tried for a more neutral design.

I agree about non-namespace-aware processors.

xml:lang is in the spec because language is universal; XML is about communicating structured information in a human-understandable manner, and that means that natural language is involved. The XML specification doesn't actually make any use of text in element content either, but it's part of XML for a similar reason :-)


DOM and SAX, posted 6 Feb 2005 at 15:56 UTC by jef » (Master)

Yeah, the parser APIs were an issue for me too, so I made a new one. You can read about it here: http://www.acme.com/software/XIP/

Re: Xml Core Parser, posted 7 Feb 2005 at 00:56 UTC by tjansen » (Journeyer)

Actually it's not only SOAP and XML-RPC. The aforementioned configuration files, WEBDAV's XML, RSS and Atom are other examples of data-oriented schemas that have the same problem. XML's complexity makes parsing more difficult than neccessary for most schemas, except for those that rely heavily on mixed content (like HTML) or keep most text in attributes (like SVG).

And I think solving that problem a worthwhile goal, because the world could really need a data model that is simple to use for different purposes. XML 1.x is definitely closer to that goal than YAML, which is too focussed on a single kind of data (and IMHO too complex; it has too many special cases and characters with special meaning). But everytime you mention XML you hear developers groan under the complexity of parsing it, directly after the complaints about readability and typability.

(BTW I think that the latter problems can be solved as well, but only by introducing some syntax on top of XML which is optimized for the special-case of XML without mixed content)

Mixed Content, posted 7 Feb 2005 at 16:43 UTC by Ankh » (Master)

During the development of XML I (and probably many others) suggested using differentiated syntax for elements that could contain text, as opposed to element-only elements.

It's rare that one wants #PCATA only (e.g. consider that some people have names that need markup, or that there are book titles with fragments of mathematics in them), although one might often think it's one one wants.

It's sort of like forms that want a telephone number, require exactly six digits (or whatever is the norm where the form writer lived) and don't allow text like "when the receptionist answers, asks her to page 306", or, "choose 6 from the menu and then I'm extension 9081" that's often needed. Or the online booking system that I could not use to book an airline ticket, because the programmer decided that Canadian postal codes consisted of five digits and no letters.

It's clear to me that there are some impedence mismatches with APIs, since a document-oriented API might not be so convnient for data, and one intended for RDBMS tuples or CSV files might not work at all for documents.

Most business data is in the form of reports, letters, memos, PostIt[tm] notes, telephone messages, and a host of other documents. That relational database is usually a small fraction of the information moving about, and the reason has been the difficulty of handling the other information. Optimising our representation so that it's good for the corner case that happens to be in high demand right now, but at the expense of making the common case even harder, might not be wise in the long term.

So I think that perhaps some work is needed on making APIs more schema-aware, but also that perhaps we should revisit the idea of distinguishing element-content-only nodes.

Whitespace turns out to be one of the hardest and most controversial parts of handling text markup. It's impossible to meet everyone's needs, but I agree that it's worth taking time to make sure that most people's needs are met, where we can do so.

Thanks for the comments!


My Suggestions, posted 14 Feb 2005 at 22:42 UTC by johnnyb » (Journeyer)

1) Get rid of some of the extra standards. XML Schemas should go bye-bye. They are trying to shoehorn XML into something that it isn't good at doing.

2) Re-emphasize processing instructions. IIRC, the XML standard specifically de-emphasizes them, which I think is a STUPID, STUPID idea. processing instructions allow you to push the same document down multiple pipelines without having to modify the DTD. In fact, I think that's how
should be implemented, as a processing instruction for display-oriented processors. <? xhtml:display-line-break ?>.

3) Rethink XML namespaces. I think its a good idea, but needs to be re-thought in order to make them easier to program with, write, and use. Specifically, namespaces should be the EXCEPTION, rather than the rule. By default you should just be coding a document according to a DTD. Probably the best place for namespaces is in the DTD customization section.

Re: My Suggestions, posted 17 Feb 2005 at 04:42 UTC by Ankh » (Master)

Schemas help people to validate, i.e. to test whether data meets constraints. They are very widely used, and (in various forms) have been used since the early days of SGML itself. XML Schema might not be the epitome of elegance, but it's definitely not trying to shoehorn XML into something it does not do well - transmitting structured information.

XHTML 2 chooses a different way to mark line breaks. There are multiple reasons for marking a line break, and a common one is semantic - e.g. consider the sepaate lines in a poem.

Processing instructions are, by nature, processor-specific, rather like pragmas in ANSI (and iSO) C. Use them when people can't agree on a better specification, with the understanding that they are fragile, unconstrained by schemas (including DTDs), and have very poor interoperability.

It's true that there are cases where a line-break could usefully be represented with a processing instruction. An example might be if you are using a specific formatter and want to correct a line breaking problem in a particular paragraph -
<?troff .br?>
for instance. If you reformatted the document with TeX (say), the line-breaks would probably be different, and you'd want that processing instruction to be ignored.

On namespaces, I think it depends very much on one's circumstances as to one's perception. I do agree that the lack of support for namespaces in DTDs is a problem. The most likely fix, if XML is to change, is to remove DTDs altogether I think, as part of a general move away from macro processing and towards higher-level constructs you can manipulate and reason about with XML-based tools.


GJXDM, posted 9 Mar 2005 at 17:59 UTC by badvogato » (Master)

encountering a new acronym GJXDM - Global Justice XML Data Model . Wondering where and if possible to translate Michael Sandel's 'the limits of justice' into the Data Model design?

Fully integrate selected related standards, posted 14 Mar 2005 at 00:46 UTC by jrobbins » (Master)

First off, I don't think there needs to be a rush to 2.0.

But, by the time that the world is finally ready for 2.0, I think that there will be an even wider array of competing standards related to specific areas of XML. E.g., which is the schema lanuage that everyone should focus on: DTDs, xml schema, RelaxNG, or something else?

Pick a "winner", and make it easier for everyone to use that standard, while still leaving room for other points of view.

I am thinking of IBM's DITA as an example, because I am using it on an upcoming version of ReadySET Pro. I see DITA as basically (a) a useful library of transforms for technical documentation and on-line help, and (b) a big workaround for the lack of inheritance in DTDs and XSLT. If XML 2.0 picked a winner for schema notation that had enough support for inheritance, it could be integrated into the next XSLT standard, the next set of XML parsing APIs, and other standards and tools.

I guess a related idea is that a good next step for XML would simply be to get more people and tools to support the latest version of all the XML-related standards that are out there now. Anything that could be done to ease or encourage implementation, transition, and adoption would help. You might even go so far as to put some kind of sunset clause into new standards: would the Web be in better shape today if the HTML 3.x and 4 standards had had sunset clauses?


Configuration in XML: a blessing, posted 1 Apr 2005 at 05:31 UTC by MartySchrader » (Journeyer)

Using XML for configurations makes my life mucho, mucho easier. I do configurations for hardware as well as software, and being able to configure old hardware using the current config file or new hardware from the old config makes all the sense in the world to me. I also like the fact that I can make a simple change to a single parameter of the config without the fear of buggering up everything else. It's great to include somebody else's config, then alter one parameter, then rock and roll. Way cool.

Please don't screw up the back and forth generational compatibility of XML to make it more, um, "efficient." XML is not about compactness, speed, or efficiency. It's about making sure that every user can interpret the data the same way, whether that user be a machine, a developer, a user, or a manager. Well, maybe not the manager.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page