Structured Storage: Quo Vadis ?

Posted 20 Feb 2004 at 14:17 UTC by ariya Share This

The design of Microsoft structured storage, which are used among others in Microsoft Office document format (OLE Compound Document), shows its age. It might not be suitable anymore to store documents.

First of all, it's rather complicated. Obviously it was modelled after FAT filesystem (which also quite ancient). Writing routines to dechipher it isn't trivial - see source code of LAOLA, libcole, libole, libgsf, poifs, or Wine's implementation - and even prone to hard-to-spot weirdnesses, not to mention the bad code readabilitiy. And writing other routines to construct it could not be considered as homework exercise anymore.

A few parts are so hard to believe. Splitting the small and big streams into their own space, with different allocation tables, might be inherently designed to overcome waste of space problem, as also in the case of FAT filesystem. But it's not the most case with 99% of real-world office document. Imitation of FAT also means facing the possible defragmentation problem. And, more critical is the allocation table. If part of your Word file is damaged, you'll be in trouble if the damage is in the so-called block allocation table (you're lucky if it's somewhere else). Also, constructing the directory structure as red-black tree seems over engineered. Instead, they can work on something else, say error checking, or compression, or even strong encryption.

Ever since Microsoft introduced new hacks here and there. For example, to support big document (when firstly designed I'm sure nobody thought of huge spreadsheets), an additional meta allocation table workaround was employed, adding another layer of complexity. One allocation table is delicate already, now there're three to deal with.

From my limited observation on latest Microsoft Office, seems that even Microsoft tried possible ways to overcome this inconvenience problem. Excel always keeps the streams at least 4 KB in size, hence no small stream will be produced. Somehow it's also stupid because if you have one worksheet with only one cell, the size of Excel document is already 14 KB. Considered that the code of structured storage is still in one of Windows DLLs and won't be gone anyway, there must be good reason why they did that. One theory (purely speculative, though) would be that Excel programmers don't believe on the reliability of some other programmers' code.

Even funnier is the fact that although structured storage is able to hold another storage inside (that happens when you embed Excel sheet in Word document), yet PowerPoint will rather hold those pictures/clip arts of your slides in one big chunk and manage them by itself. Think of it: your presentation is on the disk (using filesystem of your OS), the content is stored as structured storage (another form of filesystem) and the pictures are yet inside another container (alias mini filesystem). What a joy.

There was once security hole with Microsoft own implementation of structured storage. It was found in Microsoft Office 98 for Mac, also showed up in the Wired News, and acknowledged by Microsoft: see Q139432.

um, posted 20 Feb 2004 at 18:31 UTC by elanthis » (Journeyer)

So what exactly was the point of this article?

re: um, posted 20 Feb 2004 at 18:54 UTC by ariya » (Master)

On the first paragraph

XML, posted 21 Feb 2004 at 19:39 UTC by tjansen » (Journeyer)

Well, I guess that's why they want to switch to XML.

PPT format precedes becoming ole2 structured storage, posted 22 Feb 2004 at 18:38 UTC by caolan » (Master)

PowerPoint will rather hold those pictures/clip arts of your slides in one big chunk and manage them by itself.

This is almost certainly simply because Powerpoint wasn't originally a microsoft product and its format was set before it made use of structured storage ( so its been retro fitted back into OLE2. Nevertheless I also was astonished at the amount of layering that goes on when you see how an ole2 object embedded in a ppt document is stored as a flattened ole substorage which is then split up into chunks wedged into the ppt record structure.

Similiarly with word6/7 you can see that its simply the older pre-ole2 format with all the original data just stuffed into a single "WordDocument" stream and then just takes advantage of its new home to allow embedded ole2 substorages and the SummaryInformationStream. Its only with word8+ that they split the main stream to get the extra {0|1}Table and Data streams. You see all sorts of odd stuff in these msoffice applications that have been around for ages and are constantly being nudged along to do new things that were never part of their original design.

Why a single file not FS ?, posted 24 Feb 2004 at 11:16 UTC by Malx » (Journeyer)

It's just hard to DnD and attach to e-mail a document, which is split into directory structure of current file system. You are not shure that file attributes are OK anymore (file modification , creation date, etc). File could be sent, downloaded ...... And it must survive. :)

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page