The Chicago Project: Making access to Excel data easy

Posted 27 Feb 2002 at 17:10 UTC by jackshck Share This

As it currently stands, if you need to access Excel files under UNIX
you have to write your own code to do so. There is no standard
open-source library to access Excel data. This article will discuss
the Chicago Project, and how it attempts to fix this problem.

  The Chicago Project will create an open source library to read and write Excel files. You
may wonder why this is needed. After all, Gnumeric and StarCalc deal with these files right?

Well yes they do, but each one of them is at a different state of support.

  If you want to read or write Excel data from your application, how would you do it? Well you could look at the Gnumeric and KSpread code to figure out how they do it. Or you could write your own code.   It would be a lot simpler, if there was a standard API for accessing Excel files under UNIX. The Chicago Project, will attempt to provide that API. To do so however, we need your help. The goals of the Chicago Project are:

1. An open source C library to read and write OLE files. A good amount of code exists to deal with these files. It needs to be combined into one project. I propose the name libOLE for this project.

2. An open source C library to read and write Excel files. A good amount of code exists to deal with these files. It needs to be combined into one project. I propose the name libXLS for this project.

3. Complete documentation of the OLE file format. Documentation on this file format, is scattered all over the net. It needs to be put in one place. As I find documentation on the file format, I am linking to it at my web site. If you know of any documentation on the file format, please let me know.

4. Complete documentation of the Excel file format. The most complete source of documentation on this format, is located at http://sc.openoffice.org in PDF and XML format. However several sections are marked 2do. They need to be filled in.

5. Documentation of the Excel Encryption Algorithm. I have spent a good deal of time searching for information on this topic. All the information that I have found, can be found at: http://chicago.sourceforge.net/devel/docs/excel/encrypt.html Currently no open source spreadsheet, can open password protected Excel workbooks. This is NOT ACCEPTABLE! The wvWare project (http://wvware.sf.net), can open password protected word documents if the correct password is supplied. I have began to modify this code, to deal with Excel files.

However, as no standard library is available to deal with excel files, I am running into a brick wall while modifying the code.

If you would like to help with the Chicago Project, please e-mail me at jackshck@yahoo.com


I think this is a mistake, posted 27 Feb 2002 at 17:38 UTC by Jody » (Master)

OLE : the gnome project's libole is certainly not the most beautiful code and could use some cleaning. However, complete rewrite seems a bit over the top but that is a matter of taste.

MS Excel : A library to fully manipulate XL files is going to require support for storing all of the different objects in a spreadsheet. Doing that basicly implements a spreadsheet at some level. IMO the best bet will be to help the work on gnumeric to split out 'libgnumeric'. anything less than a full spreadsheet + extensions to handle all the extras in MS Excel is going to be lacking. Basic capabilities such as those in the read/write excel perl modules should be sufficient for most tasks, to go further than that is going to require infrastructure.

I applaud your efforts to get docs on the encryption, that would certainly be a boon. As would any additional docs you can dig up on undocumented or murky corners. However, your desire to rewrite large blocks of complex code from scratch seems misplaced. There is not enough gain to warrant it.

Re: I think this is a mistake, posted 27 Feb 2002 at 18:52 UTC by jackshck » (Journeyer)

Jody wrote

OLE : the gnome project's libole is certainly not the most beautiful code and could use some cleaning. However, complete rewrite seems a bit over the top but that is a matter of taste.

IMO libole2 is very complex, To see how simple it is to handle OLE files, please download libOle from the Chicago project.

MS Excel : A library to fully manipulate XL files is going to require support for storing all of the different objects in a spreadsheet. Doing that basicly implements a spreadsheet at some level.

If you are writing a spreadsheet anyway, then that point isn't valid.

I applaud your efforts to get docs on the encryption, that would certainly be a boon. As would any additional docs you can dig up on undocumented or murky corners.

Thank you. What should I look for docs on?

However, your desire to rewrite large blocks of complex code from scratch seems misplaced. There is not enough gain to warrant it.

I think there is the potential for a lot of gain. The reason is this: every person that writes a spread sheet, has to write the code to access Excel files from scratch. If there was a library that provided access to Excel files, it would save a lot of time and effort.

GPL is not the most suitable license for this project, posted 27 Feb 2002 at 21:13 UTC by shd » (Journeyer)

Looking at the website:

> The API will be written in ANSI C, and will be licensed under the GNU General Public License.

I believe GPL will not serve community's best interest in this case. Since this is an interoperability project it should also be possible to use this piece in non-gpl projects too. LGPL license (among other licenses) would allow that. LGPL also permits forking back to GPL.

Re: GPL is not the most suitable license for this project, posted 27 Feb 2002 at 21:33 UTC by jackshck » (Journeyer)

Hello

I have considered placing the Chicago project code under the LGPL. I am not dead set on the GPL. The final decision will be based on input from the user community.

xlhtml, posted 28 Feb 2002 at 00:16 UTC by grant » (Journeyer)

I was about to recommend xlhtml, at least for reading Excel files, since I've used it successfully for that purpose.

The xlhtml.org site was nowhere to be found, but it appears that the Chicago Project builds from xlhtml and uses its libraries.

Re: xlhtml, posted 28 Feb 2002 at 00:53 UTC by jackshck » (Journeyer)

grant wrote:

I was about to recommend xlhtml, at least for reading Excel files, since I've used it successfully for that purpose.

It is a rather good reader, so it would be a good recommendation

The xlhtml.org site was nowhere to be found

I noticed that. It was up yesterday, but is not today.

but it appears that the Chicago Project builds from xlhtml and uses its libraries.

You're right and wrong on this point. I have contributed code to the xlhtml project, but am not using any of there code at the moment. One goal of the xlhtml project, is to turn xlhtml into a library, and a application to use that library. I am going to help them with this goal. When they have completed the library, I will integrate it into the Chicago project.

Spreadsheet::WriteExcel, posted 1 Mar 2002 at 21:55 UTC by ReadMe » (Journeyer)

You might start by looking at the perl module:
Spreadsheet::WriteExcel
http://homepage.eircom.net/~jmcnamara/perl/WriteExcel.html

why?, posted 1 Mar 2002 at 23:38 UTC by ishamael » (Journeyer)

im confused. why exactly do you feel we need a standard open source library to access Excel data? perhaps forgive my narrowmindedness, but why would you need to read Excel data unless you were writing some sort of spreadsheet application. then, if you were writing such an application i would ask, why? gnumeric, kspread, etc, there are already plenty, and theyre already pretty damn slick and complete. so.... why?

Re:Spreadsheet::WriteExcel, posted 2 Mar 2002 at 17:20 UTC by jackshck » (Journeyer)

Hello

Thank you for mentioning this. I have already talked with the author about the Excel file format.

Re Why, posted 2 Mar 2002 at 17:23 UTC by jackshck » (Journeyer)

I'm confused. why exactly do you feel we need a standard open source library to access Excel data? perhaps forgive my narrowmindedness, but why would you need to read Excel data unless you were writing some sort of spreadsheet application.

Hello

  You may want to write Excel files from a database application, or something like that. If Excel can't read that format, then you have to Export a CSV file. If a standard library exists to read and write Excel data, you don't have to write a CSV file.

There IS a way to read/write Excel files under Unix and You Know It!, posted 3 Mar 2002 at 22:37 UTC by acoliver » (Journeyer)

Hi, We've corresponded in the past and I've emailed you before to see if you were interested in collaborating. There *IS* a standard project that is far along and can read/write Excel files from UNIX and its even linked (to the old site) on your page. Jakarta POI. We've also recently received some new donations that allow us to read Word (in the process of integrating) and Document Summary information. Next, we have the cleanest most complete port of the OLE 2 Compound Document format that I know of. The project is of course in Java as opposed to ANSI C, but that pretty much means you can run it darn near anywhere (we actively test it on Linux and Windoze). And obviously its APL and not GPL. So I just wanted to register that I take objection to the statement "if you need to access Excel files under UNIX you have to write your own code to do so". This response has been typed from a Linux box where all POI code is tested darn near daily. To clarify for the deceived: there IS an open source API for reading, creating and writing Excel files that runs on just about ANY platform. Its implemented using pure Java. Furthermore, there is even an XLS serializer for Cocoon for those who prefer to write XML rather than Java. (And it is compatible with the Gnumeric tag library). - Thanks.

OLE trademark?, posted 3 Mar 2002 at 22:42 UTC by acoliver » (Journeyer)

BTW, I'm curious. LibOLE2 uses "OLE" in its name. I have a feeling OLE is a trademark.. Anyone have some legal knowledge on this?

To help everyone writing a spreadsheet, posted 4 Mar 2002 at 12:24 UTC by Uraeus » (Master)

Jody wrote: MS Excel : A library to fully manipulate XL files is going to require support for storing all of the different objects in a spreadsheet. Doing that basicly implements a spreadsheet at some level.

jackshck wrote: If you are writing a spreadsheet anyway, then that point isn't valid.

I think there is the potential for a lot of gain. The reason is this: every person that writes a spread sheet, has to write the code to access Excel files from scratch. If there was a library that provided access to Excel files, it would save a lot of time and effort.

This is really funny, for some reason I have a hard time seeing the target audience of people writing their own spreadsheet as being that big. People wanting to hack on a spreadsheet should do so by joining Gnumeric or StarCalc IMHO.

Re: There IS a way to read/write Excel files under Unix and You Know It!, posted 4 Mar 2002 at 17:39 UTC by jackshck » (Journeyer)

Hi, We've corresponded in the past and I've emailed you before to see if you were interested in collaborating.

Hello


  I am interested in collaborating. I guess it is a bit silly to start another project, when yours is rather mature. The only problem I have is with the APL license. But then I can be convinced to overlook that :-}.

for Collaborating on POI, APL Rulez ;-), posted 5 Mar 2002 at 00:24 UTC by acoliver » (Journeyer)

I am interested in collaborating. I guess it is a bit silly to start another project, when yours is rather mature. The only problem I have is with the APL license. But then I can be convinced to overlook that :-}.

welcome: send mail to poi-dev-subscribe@jakarta.apache.org. If you would prefer write a C library as opposed to Java, then we can all still collaborate on fully documenting these formats (actually we donate all of our Excel documentation to the OpenOffice.org document that you mentioned). We've fully documented the OLE 2 Compound Document format (but corrections/clarifications/etc are always welcome).

The areas we most need help on are: Word format, Excel Formulas, Pivot Tables, dunno if Glen needs help on Graphing or not.

As for APL, I'm just a programmer, I get paid to write software (not POI, but other software), I strongly believe in *free* software but I really could care less about *which* kind of *free* software. All of the politics of GNU etc bore the living crap out of me to be honest. I want to write code, not argue pedantic issues and philisopical issues about the *meaning of free* and yada yada.. snore. Those kind of discussions usually get boring so quickly for me.

POI is mostly about the intellectual challange of cracking those suckers wide open. Secondly, its a way for me to never have to use Windows again in a server situation!

As for Jody's confusion about *use cases* for a generic library. I really couldn't care less about *writing* a spreadsheet from a client perspective. I barely even know how to use Excel (shocking huh). But I often am required to interoperate with such software. Try developing a reporting system in Java without Excel interoperability coming up in discussion from the users. For the work I do professionally I plan to use the POI serializers for Cocoon. I'll create reports and publish in XML. Through configuration and maybe a stylesheet I'll answer the *business requirements* and get paid. Others will use the API. Next, I work on Lucene as well (Java search engine) -- try deploying a search engine in a Fortune 500 w/o Word and Excel search capability. Then try and get paid.. There's some use cases.

Look forward to seeing you on the poi-dev list and working with you!

OLE 2 CDF docs, posted 5 Mar 2002 at 12:38 UTC by acoliver » (Journeyer)

http://jakarta.apache.org/poi/poifs/fileformat.html

Re: OLE trademark?, posted 5 Mar 2002 at 16:24 UTC by julesh » (Master)

IANAL, but I would say that the name of OLE (which is an acronym for Object Linking and Embedding) is a descriptive, generic term, and as such could not be adequately defended as a trademark. Of course, that is unlikely to stop Microsoft ("where do you want to go today (tm)?") from attempting it.

No Insult intended, posted 6 Mar 2002 at 06:41 UTC by Jody » (Master)

There is certainly a place for code to read/generate xls and other MS formats outside spreadsheets. The perl modules and the POI project seem to handle that quite nicely for their respective development environments. I take issue with the creation of _another_ such library. There are few enough of us working on these things that duplication sees foolish.

Hi Jody, posted 6 Mar 2002 at 13:02 UTC by acoliver » (Journeyer)

I took no offense to your questions. I was just offended by the idea that there was NO way to do Excel on UNIX (when the author surely knew there is). I just thought you were confused in what the use case for the Excel based library outside of a spreadsheet GUI. I agree with you that a project for documenting the formats might be a bit out of line.

I think it would be nice if OpenOffice.org and Gnome Office projects could collaborate on a plugin for these formats. Of course the reaons that may not ever happen is more political and legalistic than anything else, but *shrugs*.

As for POI we already collaborate with OpenOffice.org on their documentation of the Excel format (Daniel Rentz is a super nice guy) and furthermore we will certainly collaborate with whomever is willing on documenting the Word file format. And as you are aware the serializer (now part of Cocoon) we've developed reads the gnumeric tag language (and soon the generator will output in it), as a result POI developer Marc Johnson developed the Gnumeric XML schema and donated it to Gnumeric. *shrugs* in my view this lazy method of collaboration is the BEST way to develop opensource^H^H^H^H^H^H^H^H^H^H software.

Anyone who is interested in collaborating on a method of reading/writing any Office file format in Java is certainly welcome to join us! (http://jakarta.apache.org/poi)

I do agree that libole2 needs to be rewritten. It hurts my head :-).

Why collaboration is difficult, posted 7 Mar 2002 at 17:08 UTC by Jody » (Master)

Corrolary #2 to Rule #3
'Never attribute to malice what can be explained by laziness'

I don't think the main impediment to increased reused between the communities is politcal. As you mention Danial Rentz is quite amicable, and I'd hope the gnumeric folk have also been friendly. The trouble is that these really are different systems. The same problem arises when attempting to select a common xml format. File formats are at some level always going to be tied to their parent application's data model. Even when attempting to model similar behaviors differences are inevitable. eg, Gnumeric uses MERGE records from XL OpenCalc uses the merged flag in the XF record. Gnumeric's xml uses shared expression tags, openCalc does not. Borders are handled very differently... the list goes on. Collaboration of projects is difficult, a master slave relationship is more tenable.

I'll do what I can to work towards increasin the amount of overlap with OpenCalc (eg using their structured file format) but it is a slow process.

Clarification, posted 8 Mar 2002 at 00:08 UTC by acoliver » (Journeyer)

Hi Jody,

I don't think the main impediment to increased reused between the communities is politcal. As you mention Danial Rentz is quite amicable,

What I was actually referring to was the GNU politics. Gnu's stance that licenses cannot be mixed freely without the so called viral clause coming in to play, etc etc. Its somewhat of a hinderence often to collaboration. Anyhow I absolutely do not want to debate that with anyone, just mentioning it is often an issue.

As you mention Danial Rentz is quite amicable, and I'd hope the gnumeric folk have also been friendly.

Oh yes, you all have been just fine, I didn't meant to imply otherwise. (BTW Marc Johnson would still like to be listed as having contributed the XML Schema on whatever is the relevant place).

Gnumeric uses MERGE records from XL OpenCalc uses the merged flag in the XF record.

BTW AFAIK using the Merged Record is correct as of Biff8. Excel ignores the XF record's merged setting. I believe recent versions of OpenCalc fix this.

I'll do what I can to work towards increasin the amount of overlap with OpenCalc (eg using their structured file format) but it is a slow ? process.

Great, I think this is a better approach (aside from the obvious need to rewrite libole2 due to its love of hundreds of layers of #define expansions) for sure then creating another C project. I'd like to see those things componentized but that's a longview.

As an aside:

The POI Serializer for Cocoon made its way into Cocoon. It reads the Gnumeric XML format (generated via gnumeric or a stylesheet) and outputs it in XLS (purely via Java). While I realize you'd have to rewrite it in C, you might want to take a look at the SAX based approach we've and Cocoon have used. In my opinion this event-based *generator* *serializer* based model would simplify the effort you have to go through. (I do quietly monitor the gnumeric list but I've not seen much discussion about these things).

In my opinion, both projects have their advantages and disadvantages. From a third party point of view. I use gnumeric most of the time for my testing because it is lighter weight and more stable. OpenOffice saves BIFF8 files that are closer to Excel (as of StarOffice 5.2). The Gnumeric XML format is superior and more developed that OpenOffice's (the only issue I've really had is all of the style regions that are created for blank cells rather then just having a *default* style cover blank cells or normalzing these somehow).

Anyhow keep up the great work on Gnumeric and I hope to collaborate with you further in the future.

POI Java implementation vs C API, posted 18 Mar 2002 at 02:59 UTC by scottg » (Journeyer)

For those who wonder why you'd want to manipulate excel data on unix: we have an internal site that customers upload data that was originally in excel but is converted by a vbscript into plain text so that our unix box can read through it and do the appropriate things with it. The data is performance metric data. So, sure, the customer edits the excel file, but we want to take that file and automatically read and possibly manipulate it and get rid of the vbscript at the same time. We could probably use one of the already-existing code that's out there, but I really haven't seen anything like a simple library API to do this (though I haven't looked all that hard either :)

I'd prefer a C API over a Java API any day to access and manipulate excel files in an automated fashion.

/s.
http://scottg.net

Re: Spreadsheet::WriteExcel, posted 21 Mar 2002 at 10:58 UTC by jmcnamara » (Journeyer)


This is a better link for Spreadsheet::WriteExcel.

C API, posted 24 Mar 2002 at 16:11 UTC by acoliver » (Journeyer)

I'd prefer a C API over a Java API any day to access and manipulate excel files in an automated fashion.

Great come join us over at the POI project in producing good documentation for these file formats so that other APIs can grow from them. Or why not help out in librarizing the Gnumeric or OpenOffice.org filters?

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page