johnw is currently certified at Master level.

Name: John Wiegley
Member since: 2000-09-20 05:20:40
Last Login: 2007-11-01 11:14:00

FOAF RDF Share This

Homepage: http://www.newartisans.com/

Notes:

My journal is now kept here.

Projects

Recent blog entries by johnw

Syndication: RSS 2.0

15 May 2008 »

Using Git as a versioned data store in Python

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><html><head><meta name="generator" content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org"><title></title></head><body>

Git has sometimes been described as a versioning file-system which happens to support the underlying notions of version control. And while most people do simply use Git as a version control system, it remains true that it can be used for other tasks as well.

For example, if you ever need to store mutating data in a series of snapshots, Git may be just what you need. It’s fast, efficient, and offers a large array of command-line tools for examining and mutating the resulting data store.

To support this kind of usage – for the upcoming purpose of maintaining issue tracking data in a Git repository – I’ve created a Python class that wraps Git as a basic shelve object. Here is how you normally use the standard shelve module:

import shelve

data = shelve.open('data.db')

# data.db may or may not have existed on disk before now. If not,
# We're Manipulating an Empty Dictionary. If so, we can examine or
# modify the previous run's state data. In both cases, the database
# is manipulated like a standard Python dictionary.

data[key] = "Hello, world!"
data.sync() # Write out changes to the dictionary

del data[key]
data.close() # Close and clean up, sync'ing only if necessary

This provides the simplest kind of database, without any query language or notion of whether previous state did or did not exist. Both of those are services you’d have to layer on top of the shelve object if you wanted them.

Now consider gitshelve. Whereas the Python shelve module stores your data by pickling all of the dictionary values, I pass whatever data you place in the dictionary straight on to Git’s standard input. In the default mode, this means you work strictly with string data:

import gitshelve

data = gitshelve.open(repository = '/tmp/data.git')

data[key] = "Hello, world!"
Data.Sync() # Repository is created if it doesn't exist

del data[key]
data.close()

The interface is identical, but with the Git version you can now examine the resulting repository’s yourself, using regular Git commands:

$ GIT_DIR=/tmp/data.git git log

By default, the commits have no associated comment text, but the sync method doesn’t accept parameters. If you wish to add transaction notes, use the commit method instead:

data.commit("This is a comment")

You can store data this way either in a separate repository, or in named branches within any repository. If the repository argument is not given, the named branch within the current Git repository is used. An exception will be raised, however, if you do this and there is no Git repository related to the current directory.

# I'm expecting to use the 'data' branch of the current repository, but
# I ran the script in a directory unknown to Git!
data = gitshelve.open(branch = 'data')

# It appears to work, because no Git commands are run until the last
# possible moment
data['foo/bar/hello.txt'] = "Hello!"

# This raises an exception, because there is no current repository. To fix
# it, either run "git init", or use a specific 'repository' argument above.
data.commit("I just said hello")

The really nice thing about using Git this way is that you get all of its best features for free.<h3 id="addednon-textvalues">Added non-text values</h3>

If you have a need to store non-textual values, you’ll have to let gitshelve know how to deal with them. I don’t do any such handling by default, because of the big chance of doing the wrong thing, and having you not find out about it until it’s much too late. Just pickling data like shelve does isn’t very smart, for example, because it will wreak havoc on Git’s merge algorithms should you ever need to incorporate new data from another source.

So, let’s see how to add a custom data translator. First, you need to subclass a new type of gitbook, which is the wrapper used to interface with the blobs in the Git repository. There are only two methods you need to override:

class my_gitbook(gitshelve.gitbook):
def serialize_data(self, data):
return object_to_string(data)

def deserialize_data(self, data):
return object_from_string(data)

Now you must define object_to_string and object_from_string, which should examine the types of the objects passed and turn them into merge-friendly string as appropriate. Certain forms of XML work well for this job, as do ini-style configuration files in some cases. It’s up to you and what works best for your usage.

Once you have this new class type, you must pass it to the gitshelve.open function:

data = gitshelve.open(repository = '/tmp/foo', book_type = my_gitbook)
<h3 id="makingthingsevenfaster">Making things even faster</h3>

Every time you open a gitshelve, it must walk through the assoicated branch and determine its contents in order to build the key/value relationships in the dictionary. If you find that this ever gets slow, what you can do is just pickle the gitshelve! The only caveat is that you must take care to delete it if the HEAD you created it from is different from the current HEAD. Here’s an example:

import gitshelve
import cPickle
import os

data = None
if os.path.isfile('data.cache'):
fd = open('data.cache', 'rb')
data = cPickle.load(fd)

# I'm using an arbitrary file name here, __HEAD__
if data['__HEAD__'] != data.current_head():
data = None # Out of date, we can't use it

if not data:
data = gitshelve.open(branch = 'data')
data['__HEAD__'] = data.current_head()

# ... for data sets with enormous quantities of tiny files, this
# could really speed things up ...
<h3 id="wherecanyougetit">Where can you get it?</h3>

The gitshelve module is being maintained as part of the git-issue project, which is yet another attempt to bring distributed bug tracking to Git. Actually, I tend to support multiple repositories as data backends, but right now Git is my initial focus. You can clone the project and test it out as such:

git clone git://github.com/jwiegley/git-issues.git
cd git-issues
python t_gitshelve.py

If see “OK” at the end of the unit tests, you’re good to go! There isn’t much documentation on gitshelve.py itself right now, beyond this blog entry, but then again the shelve-like interface is simple enough that you really shouldn’t need much more.</body></html>

Syndicated 2008-05-15 04:00:59 from johnw@newartisans.com

9 May 2008 »

Emacs Chess now hosted at GitHub

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"><html><head><meta name="generator" content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org"><title></title></head><body>

Emacs Chess is a fully featured chess client written entirely in Emacs Lisp. You can use it to play against other people on freechess.org, or against popular chess engines like gnuchess and crafty. It supports graphical rendering of chess boards within Emacs (in 2D), ASCII displays, and even electronic chess boards, or producing output appropriate braille for readers. Adding a new back-end is trivial. It also comes with a library for inspecting and reasoning about chess positions.

This project is looking for someone who loves Emacs, Lisp and the game of chess, to fork it and take over as maintainer. The FSF has agreed to include Emacs Chess as part of the Emacs distribution, but I’ve held off because of a few remaining issues I want to see resolved before it goes mainstream. It does work quite well, however, and I have friends who use it as their sole client for playing chess online.

Emacs Chess is now being hosted at GitHub, which should make it easier for others to contribute:

http://github.com/jwiegley/emacs-chess

If you’d like to just clone it and try it out, run the following and then see the README:

git clone git://github.com/jwiegley/emacs-chess.git
cd emacs-chess
git submodule init
git submodule update # grab the 2D pieces and sound sets
make

After it compiles, add the emacs-chess directory to your load-path, load chess.el, and then type M-x chess!

If anyone is interested in taking over as the maintainer, or would like to contribute those last few weeks of work necessary to getting this project delivered with GNU Emacs, please contact me.</body></html>

Syndicated 2008-05-08 20:49:57 from johnw@newartisans.com

5 May 2008 »

Ready Lisp version 20080428 now available

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"><html><head><meta name="generator" content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org"><title></title></head><body>

There is a new version of Ready Lisp for Mac OS X available. This version is based on SBCL 1.0.16, and requires OS X Leopard 10.5. The most notable change from the previous version is that 64-bit mode and experimental threading are no longer supported, since both have been known to have issues on OS X, while the purpose of Ready Lisp is to smoothly introduce Common Lisp to new users.

What is Ready Lisp? It’s a binding together of several popular Lisp packages for OS X, including: Aquamacs, SBCL and SLIME. Once downloaded, you’ll have a single application bundle which you can double-click – and find yourself in a fully configured Common Lisp REPL. It’s ideal for OS X users who want to try out Lisp with a minimum of hassle. The download is approximately 76 megabytes.

There is a GnuPG signature for this file in the same directory; append .asc to the above filename to download it. To install my public key onto your keyring, use this command:

$ gpg --keyserver pgp.mit.edu --recv 0x824715A0

Once installed, you can verify the download using the following command:

$ gpg --verify ReadyLisp.dmg.asc

For more information, see the Ready Lisp project page.</body></html>

Syndicated 2008-05-05 06:21:58 from johnw@newartisans.com

28 Apr 2008 »

Git from the bottom up

In my pursuit to understand Git, it’s been helpful for me to understand it from the bottom up — rather than look at it only in terms of its high-level commands. And since Git is so beautifully simple when viewed this way, I thought others might be interested to read what I’ve found, and perhaps avoid the pain I went through finding it.

The following article offers what I've learned on this journey so far. I hope it can help others to comprehend this wonderful system, and discover some of the joy I've experienced in the past few weeks.

Here is a summary from the table of contents:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"><html><head><meta name="generator" content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org"><title></title></head><body>
  • Introduction
  • Repository: Directory content tracking
  • Introducing the blob
  • Blobs are stored in trees
  • How trees are made
  • The beauty of commits
  • A commit by any other name…
  • Branching and the power of rebase
  • Index Cache: Meet the middle man
  • Taking the index cache farther
  • To reset, or not to reset
  • Last links in the chain: Stashing and the reflog
</body></html>

Syndicated 2008-04-28 00:32:07 from johnw@newartisans.com

14 Apr 2008 »

Diving into Git

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><html><head><meta name="generator" content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org"><title></title></head><body>

This week I decided to convert my Ledger repository over to Git. Previously I’d been using Subversion for about 4 years, and CVS for 1 year before that. There was a brief flirt with Darcs, and Mercurial, but neither ever attracted me enough to convert the repository officially.

Why did I choose Git? Actually, I’d looked at Git before, maybe a year ago, and decided it was too complex and funky. But some recent articles — and new versions of Git — prompted me to look again. Yes, it still looks complex, but then again, UNIX is complex and I’ve never stopped loving that since I made my first terminal connection. In fact, when you look at Git in terms of the UNIX philosophy, rather than as a single application, it starts making a whole lot more sense. (It was written by a UNIX-ish kernel developer, after all).

Migrating my official repository represented a special challenge, because I decided I wanted my entire history, not just the Subversion parts of it. I mean, I wanted to pull the CVS repo out of the archives and thread it along with the Subversion repo into a nice, coherent history going all the way back to version 0.1.

With other tools — even Mercurial — I would have shied away from such an undertaking. But Git not only made it possible, it was even straightforward and rather fun to do. This article chronicles my adventures at manually pasting together a version control history, and how powerfully Git was able to handle this task — which would have been patently impossible using CVS or Subversion.<h2>Importing the CVS history</h2>

The first step was to import the CVS history. The Ledger project began in late August 2003, but I didn’t start using version control to track it until 24 Sep. Lucikly I had an old backup image on my laptop and was able to start hacking right away. The command to pull this initial history into Git was:

$ mkdir ledger.cvs; cd ledger.cvs; git init
$ git-cvsimport -d /tmp/cvs -v -m -p -Z,9 ledger

In this case, /tmp/cvs is where I copied the CVS repository from my backup image, since CVS requires write access in order to do fie and folder locking. The command ran very quickly, since the history was only 1 year long. After it completed, I was able to run git log right away and see what my initial commits looked like.<h2>Importing the Subversion history</h2>

The next step was to import my Subversion history from the Sourceforge server. Actually, I copied it to my local disk first which made things go much quicker. I imported this into a new repo, running the command from the same parent directory as ledger.cvs above:

$ git svn clone --no-metadata --prefix=svn/ file:///tmp/svn/ledger \
--trunk=trunk --branches=branches --tags=tags

I used the --no-metadata flag because I didn’t want git-svn-id tags littering my commit comments with uselessly redundant information. Since I don’t plan on using Subversion again for this project, there was no need to retain the tracking info.

After about 30 minutes the command completed, and presented me with a repository where the trunk, and every branch and tag, existed as remote branches. When I ran git log trunk, I saw all my Subversion history.<h2>Rebasing one history on top of another</h2>

The Subversion history was started by checking in the contents of my source tree at some particular moment in time. The question is, how do I now base the Subversion history on the CVS history, in such a way that the connection is seamless? It turns out this is incredibly easy to do with an amazingly powerful command: git-rebase.

I’ll go ahead and do this work in yet another repository, just to show how easily Git handles these kinds of things:

$ mkdir ledger.all; cd ledger.all; git init

$ git remote add cvs ../ledger.cvs/.git
$ git remote add svn ../ledger/.git

$ git fetch cvs # bring in all the CVS commits
$ git fetch svn # bring in all the Subversion commits

Now that both histories existed in one repository, I needed just one more bits of information. Namely, I needed to know the base commit of the Subversion tree (the first checkin I ever made to it). This checkin looks like a bunch of file adds, since all I did was copy in a big set of files.

$ git log | tail -10

I wrote this commit’s hash number down and kept it in a safe place. I really only needed to know the first 6 or 7 characters. Let’s assume it was bd39abb.

Next I needed to know if anything significant had changed between the last CVS commit and the first SVN commit. This would mean any changes made during the transition between version control systems. Ideally there would be none, but you never know. I went ahead and applied these changes as a patch within a new local branch, which was based on the old CVS history:

$ git checkout -b cvs-work cvs/master
$ git diff cvs/master..bd39abb | patch -p1
$ find . -type f | xargs git add
$ git commit -m "Changes between CVS and Subversion"

What this did for me is to create a branch whose final commit is identical to the starting state of the Subversion branch. It should be painless now to “rebase” the Subversion branch so that the parent of its first commit becomes the last commit of the CVS history:

$ git checkout -b svn-work svn/master
$ git rebase cvs-work

This command took a while, since it effectively “re-commited” every single commit object in the entire Subversion history. Also, since the first commit is now a null-op — the one where I checked in the current state of my files into Subversion — it just disappeared altogether from the history. The output from git log now shows my entire history from beginning to end.

I did encounter a problem here with commits that had no checkin comment. In that case, I had to supply a “no comment” string manually, and then resume the rebase operation with git rebase --continue. And if at any time I might have decided against the rebase operation, or if there were major problems, a simple git rebase --abort= would have put me right back where I started.

With the svn-work branch now representing my entire history from start to finish, I decided to make it my new local master:

$ git branch -D master
$ git branch -m svn-work master
<h2>Cleaning up history</h2>

There was a time during my Subversion days when I hastily checked in over 15 megabytes worth of dependent tool chains, thinking it would be easier for my users to obtain the exact version I was using. Many commits later I decided against this, but there was no way to avoid the fact that Subversion holds onto your mistakes forever, permanently cluttering the repository with these dead files. What I wanted to know was, can I clean those turds out of my Git history, thus reducing my ridiculously large 77 Mb repository (before packing, 31 Mb after)?

The answer was a surprsingly easy Yes; and one made possible, again, by the glorious rebase command.

The first step was to find two different commits: the one where I added the tool chain tarballs, and the one where I removed it. This can be done fairly quickly using the log command:

$ git log --stat

I just searched for “.gz”, since I knew all the tarballs ended with it. Sure enough, they were checked in by commit 87abc32 and removed by commit 7734ff0.

To edit a repository’s history, use the rebase command with its interactive option, starting it from the parent of the first commit you want to change:

$ git rebase -i 87abc32^

This command says: starting with the parent of commit 87abc32, I want the ability to rewrite, delete, or re-order all the commits that come after it. What you should see after a bit of thinking is a file with a bunch of lines that begin with “pick”. If you were to write this file out now and exit — not making any changes — it would reapply every commit in the file starting with the first. This changes the commit ids, so you can’t do this if you have observers pulling from your repository. Do it only in local branches, or before you publish your repo, as was my case here.

What I needed was to find the line pick 7734ff0 and move it right after the first line, which was pick 87abc32. I then changed the word “pick” to “squash” in the second line, meaning that I wanted rebase to put the two commits together, resulting in a commit whose diff represented the cumulative changes of the two. Since the first commit added the files (among other things), and the second commit removed them, the final result will be a commit with no tarballs in it at all, just all the other changes that happened in 87abc32.

It took about a minute for this to run, but at the end I was able to look at my new log and not see any trace of a tarball anywhere.<h2>“Bring out your dead”</h2>

The size of my .git directory, however, was still a dismaying 77 Mb. I ran git prune — to remove the repository objects no longer being referenced — but it didn’t change. What was going on? I then ran this command:

$ git fsck --unreachable
$ git fsck --lost-found
dangling commit ....
dangling blob ....

Although the --unreachable option didn’t show anything as being available for pruning, the --lost-found option showed me the very commits I had just removed, and their associated blobs (the tarballs I was concerned about). But why was Git still holding onto them?

It turns out that Git has a very, very cool feature where it keeps track of every change you make to your repository. Say, for example, that you “pop off” the most recent commit in your branch, effectively deleting it:

$ git reset --hard HEAD^

This command removes the last commit from your repository’s history and resets your working tree to match the new HEAD. It’s like the commit never happened, and so it should be gone forever now, right? Well, the real answer is: not yet.

Git still holds a pointer to your commit in the form of a “reflog”. The reflog keeps track of every change you make to the repository, allowing you to examine and possibly recover them. For example, if you used the reflog command right after your reset command you might see something like this:

$ git reflog
bc180ef... HEAD@{0}: reset --hard HEAD^: updating HEAD

It even has a hash value, which is just like a regular commit! In fact, it is a commit, except that it’s more like a “meta commit”. That is, it’s not a commit reflecting a change you’ve made to your project’s sources, but rather a commit that represents the change you just made to the repository itself. Here’s a few commands you can use to examine the reflog commit more closely:

$ git cat-file -t bc180ef    # prove to me that it's a commit
$ git ls-tree -l bc180ef # what data is it holding onto?
$ git show bc180ef # show me a patch of what I dropped

Because this commit exists in your repository’s reflog, all the blobs it references — and the file copies reflecting those changes — will continue to live on. How long? The default is 30 days. Which means that git prune and git gc will not actually delete the space taken up by that commit for another month.

In the case of my giant tarballs I wanted to realize the space savings now. So I needed to prune the reflog itself such that no commit anywhere would reference my dead tarballs:

$ git reflog expire --expire=1.minute refs/heads/master
$ git fsck --unreachable # now I see those tarball blobs!
$ git prune # hasta la vista, baby
# git gc # cleanup and repack the repo

These commands wiped out the reflog history for the specified branch (master in this case), cleaned up all the dead space, and squeezed out the redundant bits. That 77 Mb unpacked repository became a nicely packed, 2.1 Mb one.<h2>The reality wasn’t quite so easy</h2>

Figuring all this out took me some time: about 16 straight hours, and the need to restart the whole process maybe 20 times. But once I got the hang of it, I found that git’s various component tools make a whole lot of sense. There is real power here, waiting to be tapped by higher-level commands and interfaces. The kind of surgery I was able to perform — in real-time — was far beyond anything I’d ever experienced in the realm of version control systems.

And it was fast!! I rarely ever had to wait long for a change to happen, even though I was rewriting years of change history.

After this experience, far from being put off by the learning curve, I’m completely sold now. I feel like my data is wholly under my control, not subject to arbitrary things like version numbers or branch labels, etc. Everything is just a commit to Git, and the objects linked to those commits. Chain commits together from parent to child and you have a history; if a commit has multiple children, that’s a branch, while multiple parents represent a merge. How much simpler can you get?

I’ve found that sometimes, the simpler a concept is the more complex its explanation becomes — because true simplicity allows for the greatest range of expressive forms.</body></html>

Syndicated 2008-04-14 06:40:31 from johnw@newartisans.com

35 older entries...

 

johnw certified others as follows:

  • johnw certified lerdsuwa as Journeyer
  • johnw certified Nafai77 as Apprentice
  • johnw certified rms as Master
  • johnw certified walters as Master
  • johnw certified rw as Apprentice

Others have certified johnw as follows:

  • aaronl certified johnw as Master
  • jtc certified johnw as Master
  • lerdsuwa certified johnw as Master
  • Ushakov certified johnw as Master
  • walters certified johnw as Master
  • cmm certified johnw as Master
  • rw certified johnw as Master
  • maragato certified johnw as Master
  • mishan certified johnw as Master
  • Nafai77 certified johnw as Master
  • lukeg certified johnw as Master
  • dgoel3 certified johnw as Master
  • deego certified johnw as Master
  • quarl certified johnw as Master
  • bpt certified johnw as Master
  • sachac certified johnw as Master
  • dhruva certified johnw as Master
  • bkhl certified johnw as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page