Interview: Caolan McNamara

Posted 4 Sep 2000 at 19:41 UTC by advogato Share This

Advogato is pleased to interview Caolan McNamara, author of the useful wv toolkit and free expert on the Microsoft Word file formats. Since February of this year, he's been working for StarDivision on what's now the OpenOffice project to bring a topnotch office suite to the GPL world.

Advogato: How did you get started in free software?

Caolan McNamara: In college late 92 or maybe early the next year a guy called John Quinn and friends had succeeded in getting Linux onto his top-of-the-line 486 and most importantly installing circlemud onto it, so... Eventually chopping the crap out of beasties begin to pall, but being able to have your own Unix rather than have to fight with the authorities that be to get some time on a wonderful ApolloOS or Ultrix system was pretty neat. I like writing code a lot and the vast free code base accreted around Linux was incredibly useful as tools for student living in poverty and for learning from. Sitting on my ass with my maw open taking continuously didn't appeal so I made a few stabs at writing useful things to return the favour and perpetuate the system.

We don't hear too much from the Irish free software scene. How would you characterize the community there?

Well there's certainly a mountain of commercial software being written or being passed through the place, worlds biggest exporter of software and all that (see this OECD IT Outlook), there is an active community of free software users, and there's been serious Linux and Unix heads around for years. But there hasn't been the crossover to create a large amount of free software, which is bothersome. Alan Cox on a brief visit ventured to suggest that this was because we spent all of our time in the pub, a blatant unfounded racial slur of course. Nevertheless its a mysterious issue.

How did you get involved in grokking Microsoft file formats?

The 97 specs showed up on their website in July 1998 approx. I took a look at them and thought about implementing a text extraction tool that would also take the fastsaved nonsense into account. Noone else seemed interested in doing it. Once that was wrapped up it didn't seem too far fetched to expand it to "simple html markup" and "simple graphics". The AbiWord people put an awesome but scary kludge to import Word files using a wrapper around the old incoherent mswordview version so we rewrote it as wv library. wv spun off a few other bits and pieces along the way, the wvDecrypt module to decrypt word and technically other office files, libwmf to convert wmf files and ivt2html a quick hack to convert those MSDN cd ivt files to html, wvSummary to dump summary information from ole2 documents and so on. wv expanded to take the 95/6 formats into account, the contributed ole code went into cole which turned into libole2 which gnumeric and friends sits up on top of nowdays. And then I got a mail from StarOffice and they kidnapped me in January. Now I get paid to work on a vast to-be-GPLed code base, pretty neat eh.

What do you think you've learned from these about Microsoft as a company and the way they create software?

There's so many fileformats in Office that it's like an ecosystem, incestuous couplings of subformats merrily prancing away under the hood. There's little sign of careful future proofing gone into their formats. On the other hand the Escher graphic file format is quite tidy and the basic OLE2 streams concept is fine, giving programmers a file system, but giving users a single file to move around the place. So basically two thoughts:

1. Some reasonably ok ideas, but very bad follow through into correctly working clean code, (not that I might be the best person to bring that accusation).

2) The same problems all large companies and old projects suffer, incremental cruft as people forget what chunks of code are for and loose the overall picture of how things work, and start nailing functionality onto the side.

What's good and what's bad about the MSWord file format?

The good thing is that it is pretty much unchanged from the beginning, having a Word 95 reader allows you to make a fair stab at having a 97 reader with zero modifications to at least read 2000 documents. Vice versa allowing some care to handle the non OLE streamed nature of older formats you can handle them as well without an insurmountable amount of work. And MS sticks to its tried and trusted set of techniques, for instance always the same two or three compression schemes. The compression in ivt files is the same as that in cab etc.

The bad stuff is that format is buggy in places. The 95 lists were changed to a completely different 97 list format but "95 lists may still occur in 97 documents". Sounds to me like someone couldn't figure out how to remove the old code without breaking the whole thing. The 97 upgrade from 95 for the file format was to simply change practically all 8bit strings to unicode, nevertheless they themselves couldn't export to 95 except through rtf. An Indian company mentioned to me that in contact with the East Asia Microsoft they were told that there wasn't the expertise internally in MS to handle word format technical queries and fobbed them off to wv. The fastsaved technology was hijacked to kludge unicode support onto the old format, reading the old Word 2 format documentation, the 6 format and the 97 format all shows the exact same document with incremental additions. All in all, lots of evidence that its gotten completely out of hand and that Redmond has been lumbered with a fragile file format that they no more fully understand them we do. There isn't a conspiracy (or at least its a retrofitted one) that MS is actively fighting a file format battle with the world. It just grew that way.

If you were designing a word processor format from scratch, what would it be like? XML-based?

File format wise something plain human readable text like XML is the way to go. Some independent ability to validate that the input/output is sane, a builtin ability to ignore non understandable tags and attributes from future versions etc. All good stuff, lots of knowledge floating as to how XML works. On the other hand its a pain to put a graphic file or for instance an OLE2 stream for an embedded legacy app, say equation editor[1] directly into a text based xml file, though there a couple of possibilities all of which would work fine with varying degrees of ugliness. But I'm not an XML head, ill leave that to the experts.

It's pretty well understood that incompatibilities in the file format force people to upgrade their Microsoft office suites (I've heard that some files saved in MS Word 6.01 can't be loaded in 6.0 - feel free to elaborate).

I didn't know that the 6.0.1 vs 6.0 was actually a problem for word as well, but I have a memory of a wv showstopper difference between 6.0.1 and 6.0. There was something to do with the font names (FFN) or some similar structure, some extra data being appended onto some of these structures (for asian support I theorized, probably incorrectly), so that the advertised size of the structure didn't match the reality. I also have a changelog entry for a work around in the summary stream information as well for 6.0.1, but the exact details escape me.

How is that going to differ in Linux, and how do you think that will affect adoption of free software office suites?

On the MsOffice incompatibilities, any new MS version will either be incremental addons to the existing binary format which we have on the ropes and will not be a major problem. Or it will be some new thing, perhaps some sort of humourous standardmangling XML which would be much easier to import in comparison anyhow, so I don't see this as a problem, it will still mean that the average user may have to upgrade OpenOffice each time Microsoft release a new version of their suite, but it's not as if our customers have to shell out an upgrade tax for the privilege.

Incompatibilities between OpenOffice file versions shouldn't become a problem. Just ignoring unrecognized tags and attributes should avoid having to ever write a special save as older version filter, instant time saving, isn't that great. No Word 97 to 95 style fiasco, we should also be able to avoid problems like this with an open development system, with a wider group of testers, platforms and bizarre setups. So stuff where only the English version crashes when you do "complicated thing" while the German one is fine won't have as easy time of slipping through the net.

There's a lot of talk about StarOffice being Gnomified. Any word on integration with KDE?

StarOffice is not a vast corporation with gazillions of employees, its owned by one but thats not the same thing at all. So it cannot afford to spread itself thin. My belief is that no barriers will be actively placed in the way of interoperability with KDE but a choice has to be made and that the main focus will be with interoperability with GNOME's Bonobo linking and embedding because its closer to our own, the main topsecret internal technical reason being that the foot looks a lot cuter than the K. But seriously, there was always someone going to be slightly disappointed here. Anyhow if KDE sticks together a mechanism for using Bonobo components in their apps then I imagine they can play too.

You recently passed maintenance of wv to Dom Lachowicz. What are your thoughts on changing maintainers of free software projects, and wv in particular?

Its kind of tough to do it actually, there isn't a chance in hell that I'd ever have time to continue work in wv right now, and of course it makes absolutely no sense for me to work on my competitor so a new maintainer was needed, but I dodged the issue since last Christmas. Hand overs are stresful, you get very very attached to the software, child surrogate, watching it linger maintainerless is annoying, but you dread the possibility of future coders trampling all over the clever bits and making a complete mess of the design. But I think Dom will be excellent for it.

I am taking some glee from forwarding all wv mail to Dom, and reclaiming the space from the automated conversion site, (1 gig of document and wmf files in bzip2 tar files from 1st jan to 1 aug). So I am kind of glad that its handed over and I can move on, I have no idea how people like Linus or Alan handle the volume of mail, the incessant questions wore me down.

What kind of clothes do you wear?

Zero clothes sense, I stick with black, lots of black, sometimes in a spirit of lightheartedness I wear some black instead.

[1](which btw is crippleware MathType Insert object->Equation Editor->bottom left button of equation panel-> choose one of the horzonal braces save, reopen, activate equation, barf "you gotta buy the full version to use that boy")

<disclaimer> These are my personal opinions, you'd want to be utterly crazed to consider these official positions of StarOffice/Sun or even vaguely congruent with those parties. </disclaimer>

Additional question, posted 6 Sep 2000 at 04:06 UTC by tja » (Journeyer)

One question I wish had been asked: for the Gaelic-impaired such as myself, how does one pronounce "Caolan"? :-)

Re: Additional question, posted 6 Sep 2000 at 07:47 UTC by sneakums » (Journeyer)

I tend to pronounce it kway-lawn.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page