1 Feb 2008 apenwarr   » (Master)

2008-01-31: Git is the next Unix

When I first heard about git, I was suspicious that there could be anything special about it, but after watching Linus' talk about it, I was... even more suspicious. I tried it anyway.

When I tried it, I realized something right away: what made git awesome was actually none of the things Linus had talked about, not really. Those things were more like... symptoms of the underlying awesomeness. Yes, git is fast. Yes, it is distributed. Yes, it is definitely not CVS. Those things are all great, but they miss the point.

What actually matters is that git is a totally new way to operate on data. It changes the game. git has been described as "concept-heavy", because it does so many things so differently from everything else. After some reflection, I realized that this is far truer than I could see at first. git's concepts are not only unusual, they're revolutionary.

Come on, revolutionary? It's just a version control system!

Actually it's not. Git was originally not a version control system; it was designed to be the infrastructure so that someone else could build one on top. And they did; nowadays there are more than 100 git-* commands installed along with git. It's scary and confusing and weird, but what that means is git is a platform. It's a new set of nouns and verbs that we never had before. Having new nouns and verbs means we can invent entirely new things that we previously couldn't do.

Git is a new kind of filesystem, and it's faster than any filesystem I've ever seen: git checkout is faster than cp -a. It even has fsck.

Git stores revision history, and it stores it in less space than any system I've ever seen or heard of. Often, in less space than the original objects themselves!

Git uses rsync-style hash authentication on everything, as well as a new "tree of hashes" arrangement I haven't seen before, to enforce security and reliability in amazing ways that make the idea of "guaranteed identical every time" not something to strive for, but something that's always irrevocably built in.

Git names everything using globally unique identifiers that nobody else will ever accidentally use, so that being distributed is suddenly trivial.

Git is actually the missing link that has prevented me from building the things I've wanted to build in the past.

I wanted to build a distributed filesystem, but it was too much work. Now it's basically been done... in userspace, cross-platform.

At NITI we built a file backup system using what was a pretty clever data structure to speed up file accesses. But we never got around to implementing sub-file deltas, because we couldn't figure out a structure that would do it both quickly and space-efficiently. With git, they did. To build your own backup system that's much better than ours, just store it in git instead.

On top of our backup system we made a protocol for synchronizing changes up to a remote repository. Our protocol was sort of okay; git's is much better, and it will surely improve a lot in the months ahead. (Currently git requires you to sync *everything* if you want to sync *anything*, but that's an implementation restriction, not a design or protocol restriction. See shallow clones for just the beginning of this.)

Someone else I know built a hash-indexed backup system to efficiently store incremental backups from a large number of systems on a single set of disks. Git does the same, only even better, and supports sub-file deltas too.

We made a diskless workstation platform called Expression Desktop (now very dead). Knowing disks were cheap and getting cheaper, we wanted to make it "diskful" eventually, automatically syncing itself from a central server... but able to guarantee that it matched the server's files exactly. We couldn't find a protocol to do it. git is that protocol.

I built a system on top of Nitix, called Versabox, that let you install a Linux system on top of a Nitix system without virtualization. I wanted a way to make it easy to install software into that Linux environment, then repackage the entire thing as an all-in-one installer kit, but have the archive contain both the original package and the new content; that way you could upgrade either part without touching the other. To do that I invented a new file format and tool, called versatar. It works, and we use it at my new company. But git would do it much better, and includes digital signatures too for free.

Numerous people have written diff and merge systems for wikis; TWiki even uses RCS. If they used git instead, the repository would be tiny, and you could make a personal copy of the entire wiki to take on the plane with you, then sync your changes back when you're done.

When Unix pipes were invented, suddenly it was trivially easy to do something that used to be really hard: connect the output of one program to the input of the next. Pipes were the fundamental insight that shaped the face of Unix. Programs didn't have to be monolithic.

With git, we've invented a new world where revision history, checksums, and branches don't make your filesystem slower: they make it faster. They don't make your data bigger: they make it smaller. They don't risk your data integrity; they guarantee integrity. They don't centralize your data in a big database; they distribute it peer to peer.

Much like Unix itself, git's actual software doesn't matter; it's the file format, the concepts, that change everything.

Whether they're called git or not, some amazing things will come of this.

Syndicated 2008-02-01 01:33:37 from apenwarr - Business is Programming

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!