When the Library of Congress began to use computers in the 1960s, it devised
the LC MARC format. This was a simple record specification that allows for
portable catalog entries even today. The MARC 21 record format, coupled with
the z39.50 transport protocol, is used by libraries all over the US to swap
bibliographic data. But what about the Free Software world? What with the
proliferation of free POSIX systems that run on cheap hardware, it would seem
to be a no-brainer for small libraries and special collections. What's kept
this from happening?
For those of you who groaned when you read the title, I assure you that I did
that solely so I wouldn't be tempted to make any more bad library puns in the
body of this article.
Seth Schoen has been scanning
an awful lot of books with his snazzy CueCat barcode reader. His scripts make
search requests via the Library of Congress
Web Catalog, storing the result in plain text files. Although novel, this
is hardly the state of the cataloging art.
Libraries have been using computers for cataloging since long before it was
generally practical to do so. A few widely-used standards have been published,
including the MARC 21
record format and the z39.50
searching and record retrieval protocol.
The great news is that there exists a Free (MIT license) z39.50 library and
toolset called YAZ. It's planned
that the YAZ library will be integrated into Mozilla's RDF
handling, and it has already been used to create an Apache module. I
have lost the link, but I recently stumbled across a z39.50 to X.500 gateway.
Although experimental, it prompted me to think how useful it would be to create
a similar gateway using LDAP.
What I envision is a system for small collections that uses a local LDAP server
to maintain a database of MARC records. When the librarian scans in a book, it
is looked up there first. If the record is found, it appears on the screen in
some form. If the record is not found, it is fetched from the Library of
Congress via z39.50 and then displayed with an option to import the record into
the local database.
There are a few steps that need to be completed for this to happen:
- A suitable openldap schema for MARC records. It should also allow searches
to be specific to a particular library, or for to span a number of collections
(i.e. "Give me all books written by Martin Gardner in libraries in Alameda
County"). The LDAP forwarding and slurping capabilities make the details of
this fairly easy, although there are messy details involved in handling things
like inter-library loan.
- Information needs to be stored on actual physical copies of library
resources. One needs to be able to find out that (for example) the library has
two copies of The Myrkin Papers: one is on the stacks, and the
other is checked out. Perhaps it could even tell you that the checked out copy
is in poor condition, and will need to be re-bound when it is returned.
- Coming up with a user interface for the system. Web pages are fine for
undergraduate researchers, but a librarian will need something with a bit more
power. Many are currently used to using vt220s with attached barcode readers
and printers attached, but there will probably need to be some GTK or Qt apps
for catalog maintenance.
- Librarians have to deal with internationalization regularly, even with
small collections. Being able to enter a record for a monograph with a Farsi
title but a Greek author is key.
- Borrower records. Libraries need to keep track of people, as well.
There are more, but this is what comes to the front of my mind. The only thing
I will add to this is that most catalogs fail to store the two most useful
pieces of information about a library book: the books immediately to the left
and right of it, and the color and thickness of the spine. It may not help a
computer find a book, but it sure helps a human.
I'm sure that we have a number of people here who have done a lot of database
and directory development over the years. Perhaps I'm wrong, and there are
already a large number of FreeBSD and Debian boxes hidden away in major
libraries around the world, running the show without stealing it. After all,
as Cynbe Ru Taren once said, "In software as elsewhere, good engineering is
whatever gets the job done without calling attention to itself."
Lots to say about this but basically, you're right: it's a simple
failing of the library community not to have produced any decent
packages for what you want. And no, there aren't a bunch of mystery
boxen running openldap. That said, there are lots of pieces in place
and it's important to keep the issues straight.
(Btw don't miss the oss4lib
projects page. Specific highlights are Koha, a public library-scaled catalog, osdls, which has made a
nifty start toward freeing up some useful cataloging tools, and pybliographer, a
solid reference management tool in python that could be extended to pull
values out of marc records usefully. more are mentioned below)
In rough order of your mention:
- afaik all the docs on moz+z39.50+rdf etc are at least a year old
it might be possible that the project's kind of stale. Haven't checked
the code, though... hopefully Sebastian or others can correct my
assumption.
- indexdata released a YAZ-based apache module as ZAP, which is likely the same
thing you pointed to and a better, newer link than the usgs one (though
the domain isn't resolving right now for some reason). I tested the
.rpm distro and it hammered on the Yale catalog right away. Also they
linked YAZ into PHP. Great
contributions, these.
- You could transform MARC into an ldap schema, but it would likely
be
easier to use MARC.pm as a
translator and store the stuff in an xml-friendly engine. I've not used
it but all reports are that MARC.pm can handle loads of data nicely
also.
- The US Library of Congress is not the be-all and end-all
of collections. It has millions of volumes, but there are lots and lots
it
does not have. You might want to hit a variety
of Z39.50 servers; they'll all give you different results.
- If you're keen on writing a new schema for MARC, it might be
easier to do in RDF, and somebody's probably already done it (sorry, no
links for ya there :( ). In any case the RDF libraries (redfoot,
redland, 4suite, etc.) might require less coding than ldap would,
although time savings from being able to ldappily slurp up records might
make this moot.
- Your point about using ldap to distinguish regional collections
is right on; that would be great but we'd need the registry of libraries
first. Also you might be interested in reading about some of the
international work being done on identifying
localized name forms. If they push their model of
national/regional name authority control into an ldap world it would be
remarkably useful to all of us. Actually you might want to click up to
leaf through all the pages from the recent Bibliographic
Control conference at LC. All of the talks will have video streams
up soon. Don't miss Priscilla
Caplan's excellent overview of when a new descriptive metadata
schema works and
what the immediate issues are trending to be for all the nascent
next-generation schemas. This was an excellent conference filled with
true visionaries in the history, present, and future of MARC, AACR2,
Z39.50, etc (I, maybe the youngest and most unproven soul in the room,
was lucky enough to attend probably because of jake). It will be interesting to
see what will come of the recommended output from the group to LC.
- Don't mix up the issues of keeping a bunch of metadata records
and keeping a library catalog going. A library's record handling starts
with acquisitions, i.e. purchasing, and although there are EDI sets
for speeding those bits up none of our major vendors use free software
to do this afaik. Thus when an item is received, any descriptive
layers, whether they come from LC or OCLC or RLIN or are created anew,
are tied
to a purchasing record. Then circulation modules have to tie back to
these so we know what to charge users for lost items and so forth.
Yeah, yeah, all this is obviously sorta like any now-standard ecommerce
framework but if you really want to help build something that big
there's lots of pieces to it (but don't let that stop you! :). If all
you or Seth really want, though, is a
good index of your own personal collection, or that of a group of
friends, it gets easier really fast. :) Fyi also there's a NISO
Circulation protocol under development and an ISO ILL protocol
that could be trimmed by 75% and re-encoded in xml to simplify parsing.
- As for "failing to store what's on the right and left of a book"
there's more there than you think. Catalogers use a 100-year-old system
named for Cutter (who started it, natch) to assign an ordered but
arbitrary number to identify shelf position based on using alphabetic
order of a few characters from certain keywords (usually title I think).
This is essentially an ugly human-powered set-value-table hash for
uniquely locating items between others -- imagine needing
/etc/rc.d/rc4.d/S99X288357454637firewall and you'll get the point. Most
library catalogs where this system is
used are searchable by a combination of the subject classification
(general location) and Cutter code (shelf position). And a handful of
libraries track things like color,
too.
In general things like LDAP might be best used to better manage all the
personal and corporate name authority pieces but maybe not much more.
Why else are all the ecommerce bigwigs building a corporate id
registry? It's pretty much the same need. Separating out that data,
and making it free would make the catalog records easier to assemble and
more accessible to all. We could do the same for book metadata. It's
already being done for some music (freedb) and films (imdb) and it's
what we're doing for journals (jake) even though we've
got a lot of work left to prove our point.
Ok, so when I say uppity librarian I really mean it. :) But this kind
of discussion comes up occasionally (1 2
3
4)
and it's important imho to clarify perceptions. Hope this hasn't been
an over-overbearing a response... it's great to see folks asking these
questions. :p
- LDAP is good for many things, being a database isn't one of them.
The best uses of ldap treat it as a write-once read-many kind of
device. If there is any potential for contention in write access to
the data, you are screwed with ldap. There is no concept of
locking in the protocol.
- LDAP makes a pretty good searching interface into a database,
but in and of itself it's a poor choice as the database.
- Booker C. Bense
After having worked in this field for almost five years, I can only say
it's a big job. Implementing basic Z39.50 capabilities is pretty simple,
but full v3 capabilities with all the extended services etc, then we're
talking manyears.
But a lot of this already exists in the public domain. You have the
protocol stack (eg. YAZ which you mention, but also the DBVOSI II, loads
of publicly available BER systems (java and C) and whatnot).
Other more complete systems developed under the EU projects SOCKER, ONE and
ONE2. The software developed under these projects should be partially
public available, but even more important, the Z39.50 profiles for
interoperability etc.
Of course, there is little chance of finding a huge honking tarball
with the code, but with enough hassle etc, you should be able to get
some of the source codes. Some of the projects even included Java Z39.50
protocol stacks (and partial ASN.1 compilers) and clients.
Since ONE2 should still be in progress, I know it includes a big C++
api with modular backends for db systems, and support for some of the
evil Z39.50 features like the extended services, AccessRequests and
whatnot. So if the EU projects still have to disclose the source outside
the partnership, search for it.
Otherwise, I can recommend going to a ZIG
meeting next time there is one in you vincity, you'll find a lot of the
zig people are very openminded to opensource stuff etc.
uh, south park.... gotta go...