RFC: Binary Markup Language

Posted 16 Aug 2000 at 18:32 UTC by Talin Share This

This is a request for comments for Binary Markup Language (BML). BML is a platform-independent binary file meta-format which has been inspired by XML.

This is actually a really trivial idea, but I have found it to be surprisingly useful. I've talked to a number of people who really appreciate the power of XML, but are a little taken aback by the level of "bloat" due to the textual nature of the markup. For many applications this is fine, but for other applications the verbosity of XML is a significant issue.

I originally developed BML for storing collections of animated sprites for games. I wanted to store embedded PNG images within the file, which meant that using XML would be difficult. I also wanted to minimize file size, because I intended for these sprites to be downloaded while the game was playing. I later realized that the concepts could be generalized to a lot of other applications.

The current encoding scheme is very efficient, and the resulting files are tiny in comparison to XML. (Though I should admit that you could probably get better results by gzipping a normal XML file - but that would be more complex to decode.) It basically stores the data as a stream of opcodes, which when "executed" will cause a reconstruction of the original structure. Both SAX and DOM styles of parsing can be supported.

BML has been inspired by IFF, PNG, XML, Standard MIDI files, and the Python "pickle" format.

You can browse the specification for BML here. Any comments would be appreciated.

Thanks.


Reinventing the wheel?, posted 16 Aug 2000 at 18:42 UTC by egnor » (Journeyer)

Perhaps you'd like to consider WBXML, a binary form of XML (designed for WAP devices, which obviously have bandwidth constraints) that preserves the semantic structure of XML. Or maybe you'd be interested in XMill, an efficient compressor for XML that actually works better than gzip. Or maybe you'd like to perform a simple Web search to see dozens, hundreds, thousands of pages of discussion and proposed standards to solve the two problems you address (XML "bloat" and the difficulty of embedding binary data in XML files).

If you still think you have something new after all that, it would be a good idea to compare your BML to the existing proposals; otherwise, people like me will assume that you're simply ignorant of the prior art (and probably reinventing someone else's mistakes).

Reinventing the wheel? Perhaps., posted 16 Aug 2000 at 19:36 UTC by Talin » (Journeyer)

It's true I wasn't aware of the references you pointed out.

But I should be more clear - I'm not attempting to make a binary version of XML, which is what most of those other proposals seem to be wanting. What I'm really aiming for is more of a replacement for IFF. In other words, a format for storage and transmission of typed binary data, as opposed to a more compact representation for textual markup. For example, I don't have entities, but I do have enumerations. Thus, a lot of what is in those proposals (the ones I've looked at so far) doesn't really apply. That being said, there are some good ideas there.

So when I said "inspired by XML" I mean exactly that and nothing more - that is, I am borrowing some "ideas" from the semantics of XML, but this is not XML.

It it interesting that the WAP-related proposal uses the same integer encoding format that I chose, which is of couse the same one used in Standard MIDI files. However, limiting a document to only 64 element types might be fine for wireless applications, but would not work for the kinds of applications I am envisioning. Also, I notice that their proposal depends a lot on "well-known names", which limits flexibility IMHO. In other words, it tends to give control over the creation of new formats to a large, centrally-organized authority - which is exactly what you'd expect from the WAP consortium.

Reinventing the wheel (or not)..., posted 16 Aug 2000 at 20:13 UTC by egnor » (Journeyer)

It's true that WBXML has issues, as does XML itself. (The use of well-known integer identifiers isn't actually as bad as it seems at first blush, though. After all, a tag is useless if the application doesn't understand how to interpret it; the name might mean something to a human, but as far as software is concerned it might as well be a meaningless number. Effectively, schemas have to be registered somewhere in any case. So the identifiers just mean that processors have to declare all the tags they understand up front.)

Still, you're a bright, bushy-tailed free software developer with an itch; should you simply scratch it as best you know how, ignoring the morass of discussion and development and process that surrounds "standards" like XML? It's very easy to go mad trying to follow the W3C and its scions. Nobody is ever doing quite what you want, and there are lots of people who don't quite "get it" with the political clout to push their half-baked ideas. The only people I know that enjoy standards committee work are themselves insufferable!

Nevertheless, when you're talking about something like a universal file meta-format, it's very easy to end up with yet another minor de-facto "standard" that gets used in one or two applications, has a variation or two, but otherwise dies on the vine. And that's just a hassle for everyone; it litters the landscape with one more overblown framework for people to understand just because they want to read somebody's image file. You're trying to replace IFF -- be careful that your replacement doesn't suffer the same fate!

Sorry if my first post was harsh -- but I do encourage you to observe what people are doing with binary XML, and try to work in that world. If nothing else, if you do leave the fold, do so knowingly.

ASN.1 does the trick, no?, posted 16 Aug 2000 at 21:02 UTC by Toby » (Master)

I ain't no expert on these matters, but I thought that ASN.1 could be used to transmit binary data in a "platform" independant way...

BML a useful idea, but don't take it too far., posted 16 Aug 2000 at 21:16 UTC by splork » (Master)

When designing our protocols for mojonation we needed a message format. XML was considered but was rejected due to its complexity and lack of ability to easily represent binary data as well as a lack of canonical form, etc. We ended up rolling our own based on SEXPs (S expressions; Ron Rivest has a RFC on their format) that pretty much amounts to a limited form of data pickling. It can encode dicts/hashtables, lists, strings, integers and None. Strings can contain arbitrary binary data.

Check it out. Look at the mencode.py file in the evil/common within the mojonation CVS repository on sourceforge. It is in python, but should be easy enough to do in other languages as well if you're afraid of snakes.

If you need data typed beyond things like "string" "int" "list", etc.. Simply pass those in key/value dictionaries where the key string is your type. Your application will always have to define what keys it expects to be what in an application specific manner no matter what markup "language" you use.

Mojo Nation's message formats, posted 16 Aug 2000 at 21:35 UTC by Zooko » (Master)

Mojo Nation's format also had the requirement that there be a 1-to-1 mapping between canonical form and Pythonic things. That is, if you have a dict (i.e. an associative array i.e. a hash table) mapping integers to lists of strings, and you convert it to canonical form in order to transmit it to another client, there is exactly one string that it can convert to and vice versa. This is necessary so that you can digitally sign that string.

Mojo Nation's format was originally human-readable, but as it grew in scope (notably the addition of "type" annotations in the canonical form so that it knows the difference between the integer 1 and a string "1"), it became less and less easy for a human to glance at a message and understand it. Since we habitually just run "mencode.mdecode()" on the message and then look at the human-readable repr()'sentation of the actual Pythonic object, we stopped caring so much about readability of canonical form. But it is still readable if you squint.

Regards,

Zooko

hmmm, posted 16 Aug 2000 at 22:04 UTC by rillian » (Master)

I must admit that, like egnor, I initially misunderstood your intent, and he has a point about work involved in following the xml bandwagon. However, I disagree with a few of your assertions.

You're correct that running xml through gzip removes most of the bloat. This isn't appropriate for applications where every bytes counts, but for something like game sprite descriptions, it should be good enough. But it's not fair to say this "adds too much complexity". zlib takes care of the compression/decompression for, and it's a well understood format. Having to right your own read/write module for a new format seems like more complexity to me. The real value of standards like xml is that even if they're more complicated than you need, there are established tools for dealing with them and most people will be familiar with them. Folks will have an easier time fiddling with your data.

For serialization, I'd use something like .tar or .zip. Again, the code's been written and the files can be constructed with standard, widely available tools. So, for your game sprite example, what gets downloaded might be a .zip file containing a series of png (or mng) files defining the frames, and a (compressed) xml "header" file describing how they fit together; the typing for the binary data. This approach is used by a number of games already, Quake3 and CrystalSpace for example, though the 'typing' is usually implicit. The jar format works like this, too.

Take a look at ASDL, posted 17 Aug 2000 at 17:35 UTC by danwang » (Master)

http://www.cs.princeton.edu/zephyr/ASDL

Try out the interactive demo...

Read the paper from DSL97

http://www.cs.princeton.edu/zephyr/ASDL/docs/dsl97/dsl97-abstract.html

ASDL's binary format uses the same encoding as BML ....

Learn ML or rewrite asdlGen in C++ :)

Thanks for all of the comments, posted 22 Aug 2000 at 04:05 UTC by Talin » (Journeyer)

I've gone ahead and looked at many of the references you folks suggested. Despite the fact that the majority thinks that the proposal is a bad idea, or at least redundant, I'm still going to continue developing it. However, it appears that for the sake of accuracy, a name change might be in order. One suggestion was XBF - "extensible binary format".

With regards to the "fate" of IFF: At one time, there were literally hundreds of applications which supported IFF, so you can consider this a "success" in the same vein as considering Netscape Navigator a "success" despite the fact that it has a very small market share today. IFF went out of favor when the Amiga did; In the Windows world, it was replaced mostly by opaque, proprietary formates, whereas on the Internet, it was supplanted by ASCII-based protocols such as HTTP, FTP, SMTP, etc.

However, despite the popularity of IFF, it did have a couple of warts:

  • dependent on fixed-format binary structures
  • chunk identifiers limited to 4 characters
  • all data structures had to be byte-swapped on little-endian machines.
  • precalculating nested chunk lengths was a pain.

The fact that Standard Midi Files chose to use what is essentially a "stripped" IFF file - that is, an IFF-style chunk format, but with the containing "FORM" layer ripped out, is particularly revealing, especially considering the need for SMF to be produced by small, ROM-based musical hardware devices.

Anyway, if you would like to comment on the actual proposal itself (as opposed to whether the proposal is a good idea or not) I would still be interested in your feedback. Thanks...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page