30 Dec 2010 amits   » (Journeyer)

Idea: Faster Metadata Downloads With Yum and Git

The presto plugin for yum has worked great for me so far.  It's been very useful, not for the lack of download limits, but for the time saved in getting the bits downloaded.  The time saved is significant if the bandwidth is not too good (it never is).

However, I've observed in some cases the presto metadata is larger than the actual package size in some cases -- e.g., a font.  If a font package, say 21KB in size, has a deltarpm of 3KB in size, it results in a savings of 18KB of downloads.  This is a very impressive 85% of savings.  However, the presto metadata itself could be more than 400KB, nullifying the advantage of the drpm.  We're effectively downloading, in this corner case, 418KB instead of 21KB.  That is 19 times of what of the actual package size.

So here's an idea: why not let git handle the metadata for us?  The metadata is a text (or sqlite) file that lists package names, their dependencies, version numbers and so on.  Since text can be very easily handled by git, it should be a breeze fetching metadata updates from a git server.  At install-time (or upgrade-time), the metadata git repository for a particular Fedora version can be cloned, and on each update, all that's necessary for yum to do is invoke 'git pull' and it gets all the latest metadata.  Downloads: a few KB each day instead of a few MBs.

The advantages are numerous:

  • Saves server bandwidth
  • Uses very less server resources when using the git protocol
  • Scales really well
  • Compresses really well
  • Makes yum faster for users
    • I think this is the biggest win -- not having to wait ages for a 'yum search' to finish everyday has to get anyone interested.  Makes old-time Debian users like me very happy.
There are some challenges to be considered as well:
  • Should the yum metadata be served by just one canonical git server, while the packages get served by mirrors?  Not each mirror may have the git protocol enabled nor can the Fedora project ask each mirror to configure git on the server.
    • Doing this can result in slow mirrors not able to service package download requests for the latest metadata
    • This can be mitigated by using git over http over the server
  • The metadata can keep growing
    • This can be mitigated by having a separate git repository for the metadata belonging to each release.  Multiple git repos can be set up easily for extra repositories (e.g., for external repos or for multiple version repos while doing an upgrade).
  • The mirror list has to be updated to also include git repositories that can be worked on with 'git remote'.
I've filed an RFE for this feature.  For someone looking for a weekend hack for yum in python, this should be a good opportunity to jump right in!  If you intend to take this up, get in touch with the developers, make sure no one else is working on this yet (or collaborate with others) and update the details on the Fedora Feature Page.

Syndicated 2010-12-30 20:58:00 (Updated 2010-12-30 20:58:48) from Amit Shah

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!