Shared Functions: A Replacement for Shared Libraries
Posted 9 Mar 2000 at 08:28 UTC by aaronl
I would like to propose an idea I had recently that could serve as a
replacement for shared libraries. Since it would break compatibility and
would take a lot of work to implement, I don't believe that this should
be used in existing operating system. Consider this essay as my view of
the what shared libraries should have been.
From a casual perspective, shared libraries seem like a good idea. Often
times they are. For example, when my program is writen to be used with a
widget set and a C library, it makes sense to link those dynamically.
However, there are other situation where the question of what libraries
to depend on becomes harder to decide upon. If someone writes a library
of quick utility functions, should you copy the few functions you want,
or link against the library and require users to download, possibly
compile, and install the shared library. This can really get out of hand
when using several rare libraries. Copying/pasting functions will result
in code being duplicated across applications on the system, and linking
to many uncommon shared libraries can make people who want to use your
application go through extra steps to get it working (unless it is a
Debian package :) ). Let's face it, in this age many of the people using
open source programs are not hackers, and even if they are advanced
programmers it is no reason to make their lives harder.
I would like to propose an idea I had recently that could serve as a
replacement for shared libraries. Since it would break compatibility and
would take a lot of work to implement, I don't believe that this should
be used in existing operating system. Consider this essay as my view of
the what shared libraries should have been.
First, let us return to the example of the library of the utility
functions. This could be a library several megabytes large with every C
utility function you could ever imagine. Just because you want to use
several of these functions doesn't mean that a user should be forced
into installing a large shared library. What if shared libraries were
distributed at the function level? A program could depend on certain
functions rather than libraries. Combind this with a good software
dependency management system like APT, and you cut down a lot of the
cruft that must be installed. In this example, only the functions that
the program actually used would be installed, saving a lot of space and
network bandwidth. These functions would be shared across applications
through some clever usage of namespaces, perhaps as a unique prefix
prepended to each function defining what "collection" it belongs to.
Some libraries like GTK+ and OpenGL already employ this namespace
collision avoidance scheme.
The real beauty of this scheme is that no code ever has to be duplicated
to avoid dependencies which would be an inconvenience. If the code needs
a function, that will be grabbed off a server instead of a complete
shared library.
If shared libraries are a pain to download and install, how can
individual functions possibly make it easier. It would not be easier,
_unless_ the system was designed to accomodate this using a smart
package management system. Since this essay suggests standardizing on a
new library format, it is not taking it much further to sugest
standardising on a package manager :). Using APT would mean that the
distribution center for the libraries would NOT have to be standardized,
since a user could just add the distribution center for a particular
application to sources.list and grab the application. Dependencies would
be handles automatically, searching for the latest versions of all of
the libraries containing the functions it depends on in all of the
distribution sources. Having an official source for getting functions
would not be a good idea in the free software world, but people could
set up archives similar to metalab which would amass large volumes of
functions and would have APT package list files. In this way any
functions that a program depends on could be downloaded from one of the
few megaarchives listed in APT's configuration file.
Of course, APT can download and install shared libraries automatically
now. But the problem is that people are weary of adding dependencies
becasue not many people use automated depency systems like this. RPM
will tell you that you have unmet dependencies, but it is up to you to
find the packages. Since my idea is a complete fantasy anyway, I can add
a standardized package distribution system to the list of other dreams
it depends on without making it any less accessible :).
Note that when I talk about APT I am not trying to say that it would be
the only eligible program for the task. Personally, I think it's not up
to the task I'm describing, but I don't know of any similar programs. So
let's not start a war.
I'd be interested in any comments that people have. I am essensially a
newcommer to the Unix/Linux world and this idea may be dumb and boring
to read about. However, I thought it was worth sharing. Well, it seemed
like a good idea at the time ;-).
Not so simple..., posted 9 Mar 2000 at 09:32 UTC by lolo »
(Journeyer)
Well in fact the suggestion you are making turns out to be very
complex to implement. Here are a few technical objections to it:
- For the C language a breakup of libraries at a function level may
seem appropriate, but with C++ the problem is a little bit more
difficult. Shall we split the libraries at the function level, or at
the C++ class level? What about the function that are defined
implicitely like constructors/destructors? or the constructors calls for
static objects that are invoked at program startup...
- Even without considering other languages than C I think that the
breakup at function level may not be appropriate. How do you handle
global variables in this scheme? Also I think that there would be some
complex dependencies issues inside libraries themselves, where functions
depend on each other.
- Who would responsible for specifying the dependencies versus the
various functions your program needs? This is a complex task compared
to just specifying the libraries you depend on.
So IMHO the change you are proposing can be summarized as replacing a
few macro
dependency versus a big bag of functionality (a library) with a lot of
micro dependencies versus smaller entities that are themselves
inter-dependent. You introducing more complexity in the system.
The initial problem you were trying to address with this proposal was
"the complexity of the dependency management for end-users" (those
installing the software). And the solution you proposed in the end
("shared functions" and relying on a tool) has a flaw because it has
introduced more complexity on the system, and there it has made it more
failure-prone (is that english :-).
A possible solution to the initial problem is the generalization of
the use of tools like Debian Apt just like you said it. When using this
kind of tool the complexity is handled by the tool and the people
creating installable packages. That way the interface offered to
end-users
(those installing the software) is kept reasonnably simple.
It could be made even more simple by using expert-system technology
to translate statements like:
- My name is John Q. Random.
- I'm a teacher (math).
- I've an interest for electronics.
- I like playing chess and go.
Into:
- Create a login for John Q. Random.
- Install a set of educational software (a gradebook, math software,
...)
- Install some board games (and provide the user info on how to play
such games over the internet).
- Install tools for drawing and simulating electronic circuits.
Hmm, well that's all for now.
What is a library? Basically, this is a collection of object files
(.o files) that are bundled in a single package. Although it
would be
possible to distribute the object files separately, I think that it
would be
difficult to split these at the function level as you are proposing. In
many
cases, the functions contained in an object file must be kept together
because they have cross-dependencies and they also depend on static
functions that are not exported. So you cannot split these object files
unless you decide to make all internal functions and data structures
visible
to the "outside", which is usually not a good choice because that goes
against clean APIs and it makes it impossible to change the internals of
the
library without breaking the old applications.
So a more realistic version of your proposal would be to distribute
object files separately, not
functions.
That would be possible, but I am not sure that you would really gain
anything with that. You explain that the main advantage of your
proposal
would be to save disk space by only installing the functions that are
needed by the applications, instead of installing the whole libraries.
But
I think that you forgot one thing: in order to make it possible to
distribute
the parts separately, you also have to make sure that you get all the
needed parts if there are some cross-dependencies between them.
Shared libraries are a collection of object files; distributing them in
a single
package ensures that all cross-dependencies between these object files
are
satisfied. If you distribute the object files separately, then you must
have a
way of tracking down the dependencies, so that a mechanism similar to
APT can get all the required files. This means that the object files
(or some
separate files) would have to contain not only the names of the external
symbols used by the code, but also some version numbers and some other
information to make the automatic retrieval easier. This takes some
extra
space, and if you end up needing all the files that were in the original
library, you would have consumed more disk space than if you had
installed
the whole library as a single package.
Another problem is that some mechanisms like APT are fine as long as
your
computer has a direct connection to the Internet, but are painful if you
are
not connected. If you have to transfer all files on floppy disks,
resolving the
dependencies for some packages can be a nightmare. For example, if you
get a Linux distribution on CD-ROM, you can install everything and get a
working system. But if you download and install some new package later
and you discover (only after having transfered the files) that it
requires some
updates in other packages, then you will usually prefer to have to
download
two or three updated libraries instead of twenty or thirty object files.
Suppose you have two functions that depend on each other, directly or
through some other path (and there's a lot of them in a large library),
you need to ensure that compatible versions of the pair are installed.
Considering that you would have to check every possible dependence path
between two functions, it starts to get difficult quickly.
For instance, take a current library with 100 functions. Each function
has 98! (give or take a few) ways on which it could depend on each
other. So you have almost 100! possible dependencies to check.
In a system as complex as a current linux system, there wouldn't be
enough disk space to store the dependency information. Not to mention
that it would take quite a bit of work to figure them out ;-)
Of course you can simplify this by defining groups of functions and only
caring about inter-group dependencies, but then it's the same as it is
now, isn't it ;-)
I think that your idea has a couple of semi-related points.
The first is to advocate the use of tools such as the debian dependancy system to eliminate the need for a user to have to worry about
which shared code a given application uses. One of the problems with the dominant closed source desktop software is its tendancy for
new applications to distribute fresh incompatible versions of its shared code resulting in the breaking of previously installed applications.
The so called DLL Hell
I would not disagree with many of the points raised above. In a networked enviroment, where there
can be
trust of the distribution point, it is very nice to be able to give one command and get the software brought up to the currrent version. For
users who are on slow links, or have to use sneakernet for upgrades, the missing link is a tool that could walk the dependancy chain and
get all the required updates in one swoop. This might consist of a downloadable database of the current status of all packages
recognised by the distributor, and a local quiry program that would give the user a list of what supporting packages would have to be
downloaded/installed from the original CD in order to install a new package.
The second point I think you are trying to make concerns the planning of the structure for shared code. One plan would be to make
all
singing all dancing lib-everything packages, on the model of the c lib. This plan has the advantage that one download will install many
functions. The disadvantages include a large download for any change, and the posibility of a new version breaking something else.
Your proposal seems to be asking for many small libraries, each basicaly containing only one or two related functions. this has the
advantage that only the needed code would have to be downloaded, and stored but has the requirment for many more small files, and
more dependancy checks and tools.
I suspect that the real issue is for planners to resist the temptation to throw every idea that they have into a library, but instead to plan the
uses that the library will serve. If there are functions that are unrelated in a library that you are writing they should probaly be moved into a
separate package. - of course one does not have that choice if the library is already used by another program.
The general rule of life applies - Keep It simple. and I suspect that in many cases that means starting a new library rather than
adding
unrelated functionality into an existing one. IF that means creating librarys that contain only one function, then that is a valid choice.
As your program requires more shared libraries, it takes longer to
start up. For each library the program requires, the dynamic linker has
to find the file where it is stored, possibly follow a few symbolic
links and finally load the library into the program's address space.
With some of the programs in GNOME, this is already starting to
become a problem. Increasing the number of libraries that need loading
this much (a `shared function' is pretty much just a single function
shared library) will definitely have an effect on program startup
time.
Also splitting up libraries can have negative consequences --
currently libraries are internally consistent version wise due to the
way they are installed. By splitting things up like this, there is the
possibility of having different inconsistent versions of parts of the
library, unless the system is designed very well.
Dependancies., posted 9 Mar 2000 at 14:06 UTC by caolan »
(Master)
I have to say Im in favour of the goals
One example of the sort of thing that a layout like this would sort out was a something I have with libwmf where
I want to use libxpm to read in an xpm from file to xpmdata. libxpm depends on X which of course requires the usual 3 to 5 libraries. the
configure script bulges at the seams to find libxpm (which could be anywhere) along with everything that X requires. A bit of overkill to read
in a xpm file. The xpm function in libxpm being used does not need X at all.
On related topics its a real pain in the ass to keep dependancies together in other projects. Especially optional dependancies, lets take
wv, it needs nothing, but it would like to have an iconv implementation to convert charsets, if its not there then it works fine but
can only output native utf-8. Fine this is the way I want it, but if I want others to link against wv I have to also install a script along the lines
of the gnome-config thing from make install, so that a configure script searching for wv knows whether or not it has to link against libiconv.
(not to mention the countless other optional things that libwv might want, libwmf and its optional dependancies, libMagick, etc etc.
Its not ideal by any measure. Now i suspect that libtool can handle some of the workload of working out dependancies for me.
But I haven't tracked down a simple example of what I have in mind. I'd really like something like this for me. A configure line like...
AC_SUPER_CHECK_LIB(wv, main),
which runs off and finds wv, works out its dependancies for me and hands me back that list in LD_FLAGS or whatever
(aside: I want my complete list of libraries to be sorted for me so that duplicates are removed), while I am at it I want the autoconf
macros for CHECK_LIB to do more for me, look in the X library place for stuff as well as /usr and /usr/local. Suprising the amount of stuff
that ends up there!, I also want it to search for ordinary includes in the X location
On a related idea wouldn't it be nice if every program was basically a tiny executable with all of its functions in a library. Say for instance
even ls, libls would be nice. Get all the nice dir parsing for free. Get better fine control over the functioning of it that you could get with
piping. I have been thinking recently about pulling down the source to some of the basic gnu utilities which are very stable and pretty
much unchanged for years and librarising the whole lot of them and see what kind of speed difference it makes. With an eye to making
the majority of the internal functions public and documented. Im forever writing code to strip out parts of a pathname for instance, there are
a host of other simple examples which would be found floating around in common binaries.
This is just some freeform thinking, no practical problems allowed to rain on my parade here. (programs with 100dependacies come to
mind, slow. No other system would have these libraries and a program being compiled under solaris would end up requiring about have the
gnu project to print "hello world", but still it appeals to me.)
C.
- if there's static data in the .so, even if it's "proper" singleton
data with a mutex on it and everything, you will need initialization and
usage information to get things loading at the right time. look at
global constructors in .so's. it's not trivial to decide at which point
/ in which order these sort of thing get run, and you'll need to embed
that information in your APT workalike.
- C++ on linux lacks a "decoupleable" binary object model anyway.
if
your parent changes, you are stuck needing to re-layout the children. so
frequently "whole .so" upgrades of C++ programs, at the moment, are
necessary if you want the thing to run, because some superclass
somewhere changed slightly.
On a related idea wouldn't it be nice if every program was basically a
tiny
executable with all of its functions in a library. Say for
instance even ls,
libls would be nice. Get all the nice dir parsing for
free. Get better fine
control over the functioning of it that you could get with
piping.
... and end up with some really odd licensing issues, quite probably
Yes, that aside, it would be lovely if there were a slightly
more
sane interface
to most common functions than the "everything is a string" approach that
the shell command line provides.
I'd go a step further, and eliminate the need for the shell
altogether by using something like gdb to call the functions
interactively.
I'm not sure that C makes the best language to call these
functions
in, though. Much of the power of the shell is from being able to meddle
with strings at low cost and I guess some amount of that would still be
necessary. So, perhaps you'd need convenient string operators, which
implies dynamic memory allocation - ideally with reference counting or
some kind of GC.
You'd ideally want some syntax (like XML, maybe) that would
let
you
create structures of objects at the command line
... Stop me when you realise this is another poorly disguised
ad for
a
Lisp listener.
For years I've been interested in the idea of reducing program size; I don't think finer granularity is the answer.
One answer might be an object-code optimiser that can inline functions from libraries, reorder functions in the code to
improve demand paging performance, and optimise function calls. There's a lot an optimiser can do at this level. For example,
some systems have calling conventions that require registeres to be saved before entering a library function, but if you know
the library function doesn't use the registers, you can remove that code and get a speedp.
This whole area is soething I know Bjarne Stroustrop was hoping would develop for C++, because a C++ compiler
would
really benefit from a database of information about classes, methods and functions in an application. It's crazy that the
source to a function has to be in a header file in order for it to be inlined.
I think Graydon mentioned to me he had seen a first
attempt at a binary code optimiser at a Linux conference a year or two ago.
You could run such a program on a network server, so that when you run a program for the first time, if necessary, it
gets optimised for you.
I think there is a difficulty with C in that as we build larger and larger software, we start to need a
"libray of libraries" concept. A finer granularity of object libries only makes sense to me if you have tools and infrastructure
to manage it. Right now, linking against over 100 libraries will for one thing run into limitations in software on some platforms
(eg. command line length limit of 512bytes if you're typing into a Unix tty driver on many systems!) but, worse, it gets impossible
for the humans to manage. Today, humans can compile Unix programs. Well, OK, I'm subhuman and I can do it. If you lose
that ability, do you risk damaging the growth of Unix?
If file size and speed are important to you, work on a binary code optimiser. Or optimzer for USers :-)
Shared libs, posted 10 Mar 2000 at 01:51 UTC by djm »
(Master)
I am not sure that the system you propose would give much of a net gain
over shared libraries. To implement such a scheme would be complex and,
as you mentioned, incompatible with what we are currently using.
The good thing about shared libs is that the cost (storage &
memory) is
amortised over the whole system. We could achieve a good deal of the
advantages of shared functions simply by making the shared libs more
granular.
Neat idea..., posted 10 Mar 2000 at 02:16 UTC by DaveD »
(Observer)
Plenty of technical hurdles to overcome. Probably even more developer mindshare hurdles.
A better idea might be shared or distributed objects instead of functions. A function is pretty useless
out-of-context. But being able to snag an object
(e.g., attenuation response spectrum)
on demand, without grabbing an entire library for, say, non-linear seismic site response, could be
mighty handy. All the code relevant to the spectrum is self-contained within the class object file,
which is smaller than an equivalent library.
Check out NetSolve for a working
implementation of something that may be very similar to the topic of discussion.
This rather reminds me of the situation in Java, where people have a
choice of distributing and linking against .class files, each
containing the bytecode for one class, or .jar files,
containing a bundle of classes. Generally people find it simpler to
distribute jarfiles, but not always.
On the other hand the startup time to dynamically link all the
little
fiddly bits is one of the annoyances of working in Java, so...
FreeBSD ports, posted 29 Mar 2000 at 18:00 UTC by imp »
(Master)
You should consider a slightly different approach to this problem.
Your main objection to shared libraries seems to be that they are hard
to download. I will grant this is true for programs built on gtk,
especially multi-media ones that need multiple other support libraries
for sound, graphics, video, etc.
However, you should be aware that the FreeBSD ports system makes this
almost painless. All the intra-library dependencies are encoded into
the ports system, so when you want the latest cool video player that
will also display pictures, you type make all install in the right ports
directory, and all dependencies are downloaded, built and installed.
This certaily is a much less radical solution than the shared functions
that you are talking about. There is also much less of a chance for
name space collision, not to mention the global constructor problem or
the shared/non-shared data problems that others have talked about.